Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Git Traffic #1 For 2 May 2005

By Zack Brown

Table Of Contents


Welcome to the first issue of Git Traffic! After following along with the discussion for a couple of weeks, I realized I really wanted to summarize what was going on.

git has advanced at an unbelievable pace, since Linus started work on it less than 3 weeks ago. For years, kernel development took place on BitKeeper, because no free alternative had emerged that could do what Linus needed. After the first three days of development, git already successfully hosted itself and the entire kernel tree.

Part of the reason for this is that Linus's experience with BitKeeper had taught him exactly what he needed and wanted, and he was able to focus on those things to the exclusion of all else. As you'll hear several times in the following text, he has no interest in creating a true version control system, only to create something that will work for him.

Another reason for git's rapid development has been the sudden vast army of developers that poured into the project. Linus effectively stopped working on the kernel for a week or so (as he put it, the equivalent of taking a vacation), and many extremely powerful kernel hackers also turned their attention to this project. Many non kernel hackers also piled on board, and the result is something that has to be seen to be believed. It seems clear that git, originally started as a temporary solution to an immediate problem, will very soon represent a great leap forward in revision control software.

Readers should be warned. Git, Cogito, and the various related tools that have cropped up are still fluctuating wildly. The basic ideas are also fairly unfamiliar to pretty much everyone. Even long-time kernel developers are having trouble figuring them out.

Bearing all that in mind, it is not the point of this newsletter to help anyone learn how to use git. In fact, the advice and pointers given by developers and summarized here should only be considered relevant to the state of the tools at those particular moments in time.

The point of Git Traffic is instead to convert the milieu of mailing list traffic into a moderately sensible narrative, to give readers a glimpse into the development process as well as the current state of various debates, features, alternatives, and what-have-you.

So, with all these caveats aside, git development is utterly fascinating, and if you haven't been following it, you should certainly start.

1. Merging; Renames; Collisions; Alternatives; Cogito; And Speed

13 Apr 2005 - 27 Apr 2005 (144 posts) Subject: "Re: Re: Merge with git-pasky II."

Topics: update-cache, BitKeeper, CVS, Collisions, Compression, Darcs, SHA1, Subversion, commit-id, commit-tree, diff-tree, read-tree, show-diff, show-files, update-cache, write-tree

People: Linus TorvaldsJunio C. HamanoPetr BaudisDavid WoodhouseLinus Torvlads

In the course of discussion (git discussion actually began on the linux-kernel mailing list, so here we jump in in the middle), Linus Torvalds said, "I've been avoiding doing the merge-tree thing, in the hope that somebody else does what I've described. I really do suck at scripting things, yet this is clearly something where using C to do a lot of the stuff is pointless. Almost all the parts do seem to be there, ie Daniel did the "common parent" part, and the rest really does seem to be more about scripting than writing more C plumbing stuff.." Junio C. Hamano replied:

I now have a Perl script that uses rev-tree, cat-file, diff-tree, show-files (with one modification so that it can deal with pathnames with embedded newlines), update-cache (with one modification so that I can add an entry for a file that does not exist to the dircache) and merge (from RCS). Quick and dirty.

The changes to show-files is to give it an optional '-z' flag, which chanegs record terminator to NUL character instead of LF.

The script git-merge.perl takes two head commits. It basically follows what you described as I remember ;-):

  1. runs rev-tree with --edges to find the common anscestor.
  2. creates a temporary directory "./,,merge-temp"; create a symlink ./,,merge-temp/.git/objects that points at .git/objects.
  3. sets up dircache there, initially populated with this common ancestor tree. No files are checked out. Just set up .git/index and that's it.
  4. runs diff-tree to find what has been changed in each head.
  5. for each path involved:
    1. if neither heads change it, leave it as is;
    2. if only one head changes a path and the other does not, just get the changed version;
    3. if both heads change it, check all three out and run merge.

It does not currently commit. You can go to ./,,merge-temp/ and see show-diff to see the result of the merge. Files added in one head has already been run "update-cache" when the script ends, but changed and merged files are not---dircache still has the common ancestor view. So show-diff you will be seeing may be enormous and not very useful if two forks were done in the distant past. After reviewing the merge result, you can update-cache, write-tree and commit-tree as usual, but with one caveat: do not run "show-files | xargs update-cache" if you are running git-merge.perl without -f flag!

By default, git-merge.perl creates absolute minimum number of files in ./,,merge-temp---only the merged files are left there so that you can inspect them. You will not see unmodified files nor files changed only by one side of the merge.

If you give '-o' (oneside checkout) flag to git-merge.perl, then the files only one side of the merge changed are also checked out in ./,,merge-temp. If you give '-f' (full checkout) flag to git-merge.perl, then in addition to what '-o' checks out, unchanged files are checked out in ./,,merge-temp. This default is geared towards a huge tree with small merges (favorite case of Linus, if I understand correctly).

Running 'show-diff' in such a sparsely populated merge result tree gives you huge results because recent show-diff shows diffs with empty files. I added a '-r' flag to show-diff, which squelches diffs with empty files.

Also to implement 'changed only by one-side' without actually checking the file out, I needed to add one option to 'update-cache'. --cacheinfo flag is used this way:

$ update-cache --cacheinfo mode sha1 path

and adds the pathname with mode and sha1 to the .git/index without actually requiring you to have such a file there.

Linus was very pleased to see this, though he did add:

In the meantime I wrote a very stupid "merge-tree" which does things slightly differently, but I really think your approach (aka my original approach) is actually a lot faster. I was just starting to worry that the ball didn't start, so I wrote an even hackier one.

My really hacky one is called "merge-tree", and it really only merges one directory. For each entry in the directory it says either

select <mode> <sha1> path


merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path

depending on whether it could directly select the right object or not.

It's actually exactly the same algorithm as the first one, but I was afraid the first one would be so abstract that it (a) might not work and (b) wouldn't get people to work it out. This "one directory at a time with very explicit output" thing is much more down-to-earth, but it's also likely slower because it will need script help more often.

That said, I don't know. MOST of the time there will be just a single "directory" entry that needs merging, and then the script would just need to recurse into that directory with the new "tree" objects. So it might not be too horrible.

But I'm really happy that you seem to have implemented my first suggestion and I seem to have been wasting my time.

He did point out one case Junio had missed in his path assessment: "if both heads change it to the same thing, take the new thing" . He added, "but maybe you counted that as" [if neither heads change it, leave it as is] "(it _should_ fall out automatically from the fact that "diff-tree" between the two destination trees shows no difference for such a file)." Junio replied:

Actually I am not handling that. It really is 5.1a---the exact same code path as 5.1 can be used for this case, and as you point out it is really a quite important optimization.

I have to handle the following cases. I think I currently do wrong things to them:

5.1a both head modify to the same thing.
5.1b one head removes, the other does not do anything.
5.1c both head remove.
5.3 one head removes, the other head modifies.

Handling of 5.1a, 5.1b and 5.1c are obvious.

5.1a Update dircache to the same new thing. Without -f or -o flag do not touch ,,merge-temp/. directory; with -f or -o, leave the new file in ,,merge-temp/.

5.1b Remove the path from dircache and do not have the file in ,,merge-temp/. directory regardless of -f or -o flags.

5.1c Same as 5.1b

I am not sure what to do with 5.3. My knee-jerk reaction is to leave the modified result in ,,merge-temp/$path~ without touching dircache. If the merger wants to pick it up, he can rename $path~ to $path temporarily, run show-diff on it (I think giving an option to show-diff to specify paths would be helpful for this workflow), to decide if he wants to keep the file or not. Suggestions?

Linus pointed out another possible case: "one side creates a file, and the other one creates a directory." Regarding what to do about case 5.3, he went on:

My very _strong_ preference is to just inform the user about a merge that cannot be performed, and not let it be automated. BIG warning, with some way for the user to specify the end result.

The thing is, these are pretty rare cases. But in order to make people feel good about the _common_ case, it's important that they feel safe about the rare one.

Put another way: if git tells me when it can't do something (with some specificity), I can then fix the situation up and try again. I might curse a while, and maybe it ends up being so common that I might even automate it, but at least I'll be able to trust the end result.

In contrast, if git does something that _may_ be nonsensical, then I'll worry all the time, and not trust git. That's much worse than an occasional curse.

So the rule should be: only merge when it's "obviously the right thing". If it's not obvious, the merge should _not_ try to guess what the right thing is. It's much better to fail loudly.

(That's especially true early on. There may be cases that end up being obvious after some usage. But I'd rather find them by having git be too stupid, than find out the hard way that git lost some data because it thought it was ok to remove a file that had been modified)

Junio posted a patch, saying:

Here is a diff to update the git-merge.perl script I showed you earlier today ;-). It contains the following updates against your HEAD (bb95843a5a0f397270819462812735ee29796fb4).

Petr Baudis said he'd been about to work on a similar patch, when Junio "outran" him. He added, "I'll change it to use the cool git-pasky stuff (commit-id etc) and its style of committing - that is, it will merely record the update-caches to be done upon commit, and it will read-tree the branch we are merging to instead of the ancestor. (So that git diff gives useful output.)" Junio replied, "Sorry, I have not seen what you have been doing since pasky 0.3, and I have not even started to understand the mental model of the world your tool is building. That said, my gut feeling is that telling this script about git-pasky's world model might be a mistake. I'd rather see you consider the script as mere "part of the plumbing"." Linus replied:

I agree. Having separate abstraction layers is good. I'm actually very happy with Pasky's cleaned-up-tree, exactly because unlike the first one, Pasky did a great job of maintaining the abstraction between "plumbing" and user interfaces.

The plumbing should take user interface needs into account, but the more conceptually separate it is ("does it makes sense on its own?") the better off we'll be. And "merge these two trees" (which works on a _tree_ level) or "find the common commit" (which works on a _commit_ level) look like plumbing to me - the kind of things I should have written, if I weren't such a lazy slob.

The discussion diverged into a couple of tightly related subthreads at this point. In one of these, several posts down the line, Linus said:

I've not even been convinved that renames are worth it. Nobody has really given a good reason why.

There are two reasons for renames I can think of:

David Woodhouse replied, "Git doesn't actually look hard into the contents of tree; certainly it has no business looking at the contents of individual files; that is something that the SCM or possibly only the user should do. The storage of 'rename' information in the commit object is another kind of 'xattr' storage which git would provides but not directly interpret. And you're right; it shouldn't have to be for renames only. There's no need for us to limit it to one "source" and one "destination"; the SCM can use it to track content as it sees fit." But Linus thought David was missing the point:

First off, let's just posit that "files" do not matter. The only thing that matters is how "content" moved in the tree. Ok? If I copy a function from one fiel to another, the perfect SCM will notice that, and show it as a diff that removes it from one file and adds it to another, and is _still_ able to track authorship past the move. Agreed?

Now, you basically propose to put that information in the "commit" log, and that's certainly valid. You can have the commit log say "lines 50-89 in file kernel/sched.c moved to lines 100-139 in kernel/timer.c", and then renames fall out of that as one very small special case.

You can even say "lines 50-89 in file kernel/sched.c copied to.." and allow data to be tracked past not just movement, but also duplication.

Do you agree that this is kind of what you'd want to aim for? That's a winning SCM concept.

How do you think the SCM _gets_ at this information? In particular, how are you proposing that we determine this, especially since 90% of all stuff comes in as patches etc?

You propose that we spend time when generating the tree on doing so. I'm telling you that that is wrong, for several reasons:

Now, look at my proposal:

The above tool is (a) fairly easy to write for git (if you can do visualization tools and (b) _exactly_ what I think most programmers actually want. Tell me I'm wrong. Honestly..

And notice? My clearly _superior_ algorithm never needed any rename information at all. It would have been a total waste of time. It would also have hidden the _real_ pattern, which was that a piece of code was merged from several other matching pieces of code into one new helper function. But if it _had_ been a pure rename, my superior tool would have trivially found that _too_. So rename infomation really really doesn't matter.

So I'm claiming that any SCM that tries to track renames is fundamentally broken unless it does so for internal reasons (ie to allow efficient deltas), exactly because renames do not matter. They don't help you, and they aren't what you were interested in _anyway_.

What matters is finding "where did this come from", and the git architecture does that very well indeed - much better than anything else out there. I outlined a simple algorithm that can be fairly trivially coded up by somebody who really cares. Sure, pattern matching isn't trivial, but you start out with just saying "let's find that exact line, and two lines on each side", and then you start improving on that.

And that "where did this come from" decision should be done at _search_ time, not commit time. Because at that time it's not only trivial to do, but at that time you can _dynamically_ change your search criteria. For example, you can make the "match" algorithm be dependent on what you are looking at.

If it's C source code, it might want to ignore vairable names when it searches for matching code. And if it's a OpenOffice document, you might have some open-office-specific tools to do so. See? Also, the person doing the searches can say whether he is interested in that particular line (or even that particial _identifier_ on a line), or whether he wants to see the changes "around" that line.

All of which are very valid things to do, and all of which my world-view supports very well indeed. And all of which your pitiful "files matter" world-view totally doesn't get at all.

In other words, I'm right. I'm always right, but sometimes I'm more right than other times. And dammit, when I say "files don't matter", I'm really really Right(tm).

Please stop this "track files" crap. Git tracks _exactly_ what matters, namely "collections of files". Nothing else is relevant, and even _thinking_ that it is relevant only limits your world-view. Notice how the notion of CVS "annotate" always inevitably ends up limiting how people use it. I think it's a totally useless piece of crap, and I've described something that I think is a million times more useful, and it all fell out _exactly_ because I'm not limiting my thinking to the wrong model of the world.

Further along in that subthread, it was suggested that the SHA1 hash, used by git to uniquely identify files, might allow collisions during normal use, i.e. two files with the same SHA1 key. There were various ideas of what other information could be used to help avoid these collisions, but Linus said:

Note that using anything that isn't data-related totally destroys the whole point of the object database. Remember: any time we don't uniquely generate the same name for the same object, we'll waste disk-space.

So adding in user/machine/uuid's to the thing is always a mistake. The whole thing depends on the hash being as close to 1:1 with the contents as humanly possible.

There's also the issue of size. Yes, I could have chosen sha256 instead of sha1. But the keys would be almost twice as big, which in turn means that the "tree" objects would be bigger, and that the "index" file would be bigger.

Is that a huge problem? No. We can certainly move to it if sha1 ever shows itself to be weak. But I really think we are much better off just re-generating the whole tree and history at that point, rather than try to predict the future.

The fact is, with current knowledge, sha1 _is_ safe for what git uses it for, for the forseeable future. And we have a migration strategy if I'm wrong. Don't worry about it.

Almost all attacks on sha1 will depend on _replacing_ a file with a bogus new one. So guys, instead of using sha256 or going overboard, just make sure that when you synchronize, you NEVER import a file you already have.

It's really that simple. Add "--ignore-existing" to your rsync scripts, and you're pretty much done. That guarantees that a new evil blob by the next mad scientist out to take over the world will never touch your repository, and if we make this part of the _standard_ scripts, then dammit, security is in good _practices_ rather than just relying blindly on the hash being secure.

In other words, I think we could have used md5's as the hash, if we just make sure we have good practices. And it wouldn't have been "insecure".

The fact is, you don't merge with people you don't trust. If you don't trust them, they have a much easier time corrupting your repository by just creating bugs in the code and checking that thing in. Who cares about hash collisions, when you can generate a kernel root vulnerability by just adding a single line of code and use the _correct_ hash for it.

So the sha1 hash does not replace _trust_. That comes from something else altogether.

Various folks, including Ingo Molnar, described the dangers of relying too heavily on SHA1. Ingo posted a security exploit involving a malicious hacker submitting a file that, by virtue of unusual formatting, would have the same key as another file already in the system. But Linus replied:

The fact is, a lot of _crap_ engineering gets done because of the question "what if?". It results in over-engineering, often to the point where the end result is quite a lot measurably worse than the sane results.

You are _literally_ arguing for the equivalent of "what if a meteorite hit my plane while it was in flight - maybe I should add three inches of high-tension armored steel around the plane, so that my passengers would be protected".

That's not engineering. That's five-year-olds discussing building their imaginary forts ("I want gun-turrets and a mechanical horse one mile high, and my command center is 5 miles under-ground and totally encased in 5 meters of lead").

I absolutely _hate_ doing engineering on the principle of "this might be possible in theory", and I'm violently opposed to it. So far, I have not heard a single argument that I consider even _remotely_ likely.

The thing is, even if you can force a hash collission by sending somebody a patch, it's really pretty much almost guaranteed that the patch is not just "a few strange characters", unless sha1 is really broken to the point where it's not cryptographically secure _at_all_.

In other words, unless somebody finds a way to make sha1 appear as nothing more than a complicated set of parity bits, all brute-force "get the same sha1" is likely to be about generating a really strange blob based on the thing you want to replace - and by "really strange" I mean total binary crap. And likely _much_ bigger too. And by "much bigger" I mean "possibly gigabytes of data".

And the thing is, _if_ somebody finds a way to make sha1 act as just a complex parity bit, and comes up with generating a clashing object that actually makes sense, then going to sha256 is likely pointless too - I think the algorithm is basically the same, just with more bits. If you've broken sha1 to the point where it's _that_ breakable, then you've likely broken sha256 too. Nobody has ever proven that you couldn't break sha256 with some really clever algorithm...

So if you start playing "what if?" games, dammit, I can play mine.

If we want to have any kind of confidence that the hash is reall yunbreakable, we should make it not just longer than 160 bits, we should make sure that it's two or more hashes, and that they are based on totally different principles.

And we should all digitally sign every single object too, and we should use 4096-bit PGP keys and unguessable passphrases that are at least 20 words in length. And we should then build a bunker 5 miles underground, encased in lead, so that somebody cannot flip a few bits with a ray-gun, and make us believe that the sha1's match when they don't. Oh, and we need to all wear aluminum propeller beanies to make sure that they don't use that ray-gun to make us do the modification _outselves_.

And the thing is, that's just crazy talk. The difference between a crazy person and an intelligent one is that the crazy one doesn't realize what makes sense in the world. The goal of good engineering is not to ask "what if?", but to ask "how do I make this work as well as possible".

So please stop with the theoretical sha1 attacks. It is simply NOT TRUE that you can generate an object that looks halfway sane and still gets you the sha1 you want. Even the "breakage" doesn't actually do that. And if it ever _does_ become true, it will quite possibly be thanks to some technology that breaks other hashes too.

So until proven otherwise, I worry about accidental hashes, and in 160 bits of good hashing, that just isn't an issue either. Anybody who compares a 128-bit md5-sum to a 160-bit sha1 doesn't understand the math. It didn't get "slightly less likely" to happen. It got so _unbelievably_ less likely to happen that it's not even funny.

At around this point, Bram Cohen and Tom Lord posted some criticism's of Linus's whole approach, saying he was neglecting key issues that they had already solved. Linus said, "Git does in ~5000 lines and two weeks of work what _I_ think is the right thing to do. You're welcome to disagree, but the fact is, people have whined and moaned about my use of BK FOR THREE YEARS without showing me any better alternatives. So why are you complaining now, when I implement my own version in two weeks?" He added:

Btw, I've also been pretty disgusted by SCM's apparently generally caring about stuff that is totally not relevant.

For example, it seems like most SCM people think that merging is about getting the end result of two conflicting patches right.

In my opinion, that's the _least_ important part of a merge. Maybe the kernel is very unusual in this, but basically true _conflicts_ are not only rare, but they tend to be things you want a human to look at regardless.

The important part of a merge is not how it handles conflicts (which need to be verified by a human anyway if they are at all interesting), but that it should meld the history together right so that you have a new solid base for future merges.

In other words, the important part is the _trivial_ part: the naming of the parents, and keeping track of their relationship. Not the clashes.

For example, CVS gets this part totally wrong. Sure, it can merge the contents, but it totally ignores the important part, so once you've done a merge, you're pretty much up shit creek wrt any subsequent merges in any other direction. All the other CVS problems pale in comparison. Renames? Just a detail.

And it looks like 99% of SCM people seem to think that the solution to that is to be more clever about content merges. Which misses the point entirely.

Don't get me wrong: content merges are nice, but they are _gravy_. They are not important. You can do them manually if you have to. What's important is that once you _have_ done them (manually or automatically), the system had better be able to go on, knowing that they've been done.

Bram and Tom continued ranting, Bram calling Linus an ass, and Tom calling Linus nuts, and at one point Linus said:

You haven't looked at git, have you?

Git already merges better than _any_ open-source SCM out there. It just does it so effortlessly that you didn't even realize it does that.

Today I've done four (count them) fully automated merges on the kernel tree: serial, networking, usb and arm.

And they took a fraction of a second (plus the download of the new objects, which is the real cost).

This is something that SVN _still_ cannot do, for example.

Elsewhere, Petr and Linus discussed the future relationship of git and Cogito (or git-pasky). Petr said:

I assume that you don't want to merge my "SCM layer" (which is perfectly fine by me). However, I also apply plenty of patches concerning the "core git" - be it portability, leak fixes, argument parsing fixes and so on.

Would it be of any benefit if I maintained two trees, one with just your core git but what I merge (I think I'd call this branch git-pb), and one with my git-pasky (to be renamed to Cogito) layer. I'd then put the "core git" changes to the git-pb branch and pull from it to the Cogito branch regularily, but it should be safe for you to pull from it too.

In fact, in that case I might even end up entirely separating the Cogito tools from the core git and distributing them independently.

BTW, just out of interest, are you personally planning to use Cogito for your kernel and sparse (and possibly even git) work, or will you stay with your lowlevel plumbing for that?

Linus replied, "I'm actually perfectly happy to merge your SCM layer too eventually, but I'm nervous at this point. Especially while people are discussing some SCM options that I'm personally very leery of, and think that may make sense for others, but that I personally distrust." Regarding whether he would use Cogito, Linus said:

I'm really really hoping I'd use cogito, and that it ends up being just one project. In particular, I'm hoping that in a few days, I'll have done enough plumbing that I don't even care any more, and then I'd not even maintain a tree of my own.

I'm really not that much of an SCM guy. I detest pretty much all SCM's out there, and while it's been interesting to do 'git', I've done it because I was forced to, and because I really wanted to put _my_ needs and opinions first in an SCM, and see how that works. That's why I've been so adamant about having a "philosophy", because otherwise I'd probably just end up with yet another SCM that I'd despise.

So for me, the "optimal" situation really ends up that you guys end up as the maintainers. I don't even _want_ to maintain it, although I'd be more than happy to be part of the engineering team. I just want to mark out the direction well enough and get it to a point where I can _use_ it, that I feel like I'm done.

But before I can do that, I need to feel like I can live with the end result. The only missing part is merges, and I think you and Junio are getting pretty close (with Daniel's parent finder, Junio's merger etc).

Petr said, "it is great news to me. Actually, in that case, is it worth renaming it to Cogito and using cg to invoke it? Wouldn't be that actually more confusing after it gets merged? IOW, should I stick to "git" or feel free to rename it to "cg"?" Linus replied:

I'm perfectly happy for it to stay as "git", and in general I don't have any huge preferences either way. You guys can discuss names as much as you like, it's the "tracking renames" and "how to merge" things that worry me.

I think I've explained my name tracking worries. When it comes to "how to merge", there's three issues:

The third point is why I'm going to the ultra-conservative "three-way merge from the common parent". It's not fancy, but it's something I feel comfortable with as a merge strategy. For example, arch (and in particular darcs) seems to want to try to be "clever" about the merges, and I'd always live in fear.

And, finally, there's obviously performance. I _think_ a normal merge with nary a conflict and just a few tens of files changed should be possible in a second. I realize that sounds crazy to some people, but I think it's entirely doable. Half of that is writing the new tree out (that is a relative costly op due to the compression). The other half is the "work".

Elsewhere, Linus told Junio he had a "cunning plan": "It turns out that I can do merges even more simply, if I just allow the notion of "state" into an index entry, and allow multiple index entries with the same name as long as they differ in "state". And that means that I can do all the merging in the regular index tree, using very simple rules. Let's see how that works out. I'm writing the code now." Linus replied to himself:

Damn, my cunning plan is some good stuff.

Or maybe it is _so_ cunning that I just confuse even myself. But it looks like it is actually working, and that it allows pretty much instantaenous merges.

The plan goes like this:

Ok, this all sounds like a collection of totally nonsensical rules, but it's actually exactly what you want in order to do a fast merge. The differnt stages represent the "result tree" (stage 0, aka "merged"), the original tree (stage 1, aka "orig"), and the two trees you are trying to merge (stage 2 and 3 respectively).

In fact, the way "read-tree" works, it's entirely agnostic about how you assign the stages, and you could really assign them any which way, and the above is just a suggested way to do it (except since "write-tree" refuses to write anything but stage0 entries, it makes sense to always consider stage 0 to be the "full merge" state).

So what happens? Try it out. Select the original tree, and two trees to merge, and look how it works:

So now the merge algorithm ends up being really simple:

NOTE NOTE NOTE! I could make "read-tree" do some of these nontrivial merges, but I ended up deciding that only the "matches in all three states" thing collapses by default. Why? Because even though there are other trivial cases ("matches in both merge trees but not in the original one"), those cases might actually be interesting for the merge logic to know about, so I thought I'd leave all that information around. I expect it to be fairly rare anyway, so writing out a few extra index entries to disk so that others can decide to annotate the merge a bit more sounded like a fair deal.

I should make "ls-files" have a "-l" format, which shows the index and the mode for each file too. Right now it's very hard to see what the contents of the index is. But all my tests seem to say that not only does this work, it's pretty efficient too. And it's dead _simple_, thanks to having all the merge information in just one place, the same index we always use anyway.

Btw, it also means that you don't even have to have a separate subdirectory for this. All the information literally is in the index file, which is a temporary thing anyway. We don't need to worry about what is in the working directory, since we'll never show it, and we'll never need to use it.

Damn, I'm good.

(On the other hand, it is Friday evening at 11PM, and I'm sitting in front of the computer. I'm a sad case. I will now go take a beer, and relax. I think this is another of my "Really Good Ideas" (tm), and is worth the beer. This "feels" right).

Junio was completely blown away by this, and started coding madly. After a flurry of activity from both of them, Linus said:

Junio, I pushed this out, along with the two patches from you. It's still more anal than my original "tree-diff" algorithm, in that it refuses to touch anything where the name isn't the same in all three versions (original, new1 and new2), but now it does the "if two of them match, just select the result directly" trivial merges.

I really cannot see any sane case where user policy might dictate doing anything else, but if somebody can come up with an argument for a merge algorithm that wouldn't do what that trivial merge does, we can make a flag for "don't merge at all".

The reason I do want to merge at all in "read-tree" is that I want to avoid having to write out a huge index-file (it's 1.6MB on the kernel, so if you don't do _any_ trivial merges, it would be 4.8MB after reading three trees) and then having people read it and parse it just to do stuff that is obvious. Touching 5MB of data isn't cheap, even if you don't do a whole lot to it.

Anyway, with the modified read-tree, as far as I can tell it will now merge all the cases where one side has done something to a file, and the other side has left it alone (or where both sides have done the exact same modification). That should _really_ cut down the cases to just a few files for most of the kernel merges I can think of.

Does it do the right thing for your tests?

Junio confirmed it did.

2. Memory Leaks Not Critical In Git

14 Apr 2005 - 18 Apr 2005 (22 posts) Subject: "[patch] git: fix memory leak in checkout-cache.c"

Topics: update-cache, update-cache

People: Linus TorvaldsIngo Molnar

Ingo posted several patches to fix memory leaks in git. To one of these, Linus Torvalds replied:

if the common case is that we update a couple of entries in the active cache, we actually saved 1.6MB (+ malloc overhead for the 17 _thousand_ allocations) by my approach.

And the leak? There's none. We never actually update an existing entry that was allocated with malloc(), unless the user does something stupid. In other words, the only case where there is a "leak" is when the user does something like

update-cache file file file file file file ..

with the same file listed several times.

And dammit, the whole point of doing stuff in user space is that the kernel takes care of business. Unlike kernel work, leaking is ok. You just have to make sure that it is limited enough to to not be a problem. I'm saying that in this case we're _better_ off leaking, because the mmap() trick saves us more memory than the leak can ever leak.

(The command line is limited to 128kB or so, which means that the most files you _can_ add with a single update-cache is _less_ than the mmap win).

It was _such_ a relief to program in user mode for a change. Not having to care about the small stuff is wonderful.

Ingo replied, "fair enough - as long as this is only used in a scripted environment, and not via some library and not within a repository server, web backend, etc."

3. Patches For Cogito Coming In Fast And Furious

14 Apr 2005 (5 posts) Subject: "[PATCH] Git pasky include commit-id in Makefile"

Topics: commit-id

People: Darren WilliamsPetr BaudisMartin Schlemmer

Darren Williams said, "Currently the commit-id script is not install with make install, this patches includes it in the SCRIPT target. This patch is against git-pasky-0.4" Martin Schlemmer noticed the fix had not gone in by later that day; and Petr Baudis replied, "Come on people, give me a while. I just started walking through the queued bugfixes. ;-)" And Martin said, "Sorry, thought you missed it as you had in later patches =)"

4. Consideration Of File Renames; First Mention Of 'Cogito' For Tool Name

14 Apr 2005 - 15 Apr 2005 (22 posts) Subject: "Handling renames."

Topics: SHA1, diff-tree

People: David WoodhouseLinus TorvaldsIngo MolnarH. Peter AnvinZach WelchAndrew Timberlake-NewellSteven ColePetr Baudis

David Woodhouse suggested a method for handling file renames, saying:

One proposed solution was to have a separate revision history for individual files, with a new kind of 'filecommit' object which parallels the existing 'commit', referencing a blob instead of a tree. Then trees would reference such objects instead of referencing blobs directly.

I think that introduces a lot of redundancy though, because 99% of the time, the revision history of the individual file is entirely reproducible from the revision history of the tree. It's only when files are renamed that we fall over -- and I think we can handle renames fairly well if we just log them in the commit object.

My '' script is already capable of tracking a given file back through multiple tree commits, listing those commits where the file in question was actually changed. It uses my patched version of diff- tree which supports 'diff-tree <TREE_A> <TREE_B> <filename>' in order to do this.

By storing rename information in the commit object, the script (or a reimplementation of a similar algorithm) could know when to change the filename it's looking for, as it goes back through the tree. That ought to be perfectly sufficient.

So a commit involving a rename would look something like this...

        tree 82ba574c85e9a2e4652419c88244e9dd1bfa8baa
        parent bb95843a5a0f397270819462812735ee29796fb4
        rename foo.c bar.c
        author David Woodhouse <> 1113499881 +0100
        committer David Woodhouse <> 1113499881 +0100
        Rename foo.c to bar.c and s/foo_/bar_/g

Linus Torvalds did not like this idea. He said:

A "rename" really doesn't exist in the git model. The git model really is about tracking data, not about tracking what happened to _create_ that data.

The one exception is the commit log. That's where you put the explanations of _why_ the data changed. And git itself doesn't care what the format is, apart from the git header.

So, you really need to think of git as a filesystem. You can then implement an SCM _on_top_of_it_

He went on, regarding David's proposed change to the git header:

I want that empty line in there, and I want it in the "free-form" section. The "rename" part really isn't part of the git header. It's not what git tracks, it was tracked by an SCM system on top of git.

So the git header is an "inode" in the git filesystem, and like an inode it has a ctime and an mtime, and pointers to the data. So as far as git is concerned, this part:

        tree 82ba574c85e9a2e4652419c88244e9dd1bfa8baa
        parent bb95843a5a0f397270819462812735ee29796fb4
        author David Woodhouse <> 1113499881 +0100
        committer David Woodhouse <> 1113499881 +0100

really is the filesystem "inode". The rest is whatever the filesystem user puts into it, and git won't care.

At this point he added:

There are too many messy SCM's out there that do not hav ea "philosophy". Dammit, I'm not interested in creating another one. This thing has a mental model, and we keep to that model.

The reason UNIX is beautiful is that it has a mental model of processes and files. Git has a mental model of objects and certain very very limited relationships. The relationships git cares about are encoded in the C files, the "extra crap" (like rename info) is just that - stuff that random scripts wrote, and that is just informational and not central to the model.

Close by, Ingo Molnar suggested having a rename_commit special case for commits involving file renames. He said, "If a rename happens then the old tree references the rename_commit object (instead of the blob), and the new tree references it too. This way there's no need to list the rename via namespace means: if a tree entry points to a rename_commit object then a rename happened and the rename_commit object is looked up in the old tree to get the old name." But Linus didn't like this idea either. He said:

ANYTHING you do with games like this will cause the "same directory creates different object" crap.

Git doesn't do that. There fundamentally is no history in objects, _except_ for the commit object. Two objects with the same name are identical, and that means that they are easy to share.

Any time you break that model, you break the whole point of git. Don't do it. You'll be very very sorry if you ever do, because it breaks the clean separation of "time" and "space". I guarantee you that your merges will become _harder_ rather than easier.

What you can do at an SCM level, is that if you want to track renames, you track them as a separate commit altogether. Ie if you notice a rename, you first commit the rename (and you can _see_ it's a rename, since the object didn't change, and the sha1 stayed the same, which in git-speak means that it is the same object, ie that _is_ a rename as far as git is concerned), and then you create the "this is the data that changed" as a _second_ commit.

But don't make it a new kind of commit. It's just a regular commit, dammit. No new abstractions.

Trust me, it's worth it to follow the rules. You don't start making up new concepts for every new thing you track. Next you'll want "tag objects". That's a totally idiotic idea. What you do is you tag things at a higher level than git ever is, and git will _never_ have to know about tag objects.

Some "higher level" thing can add its own rules _on_top_ of git rules. The same way we have normal applications having their _own_ rules on top of the kernel. You do abstraction in layers, but for this to work, the base you build on top of had better be damn solid, and not have any ugly special cases.

Elsewhere, H. Peter Anvin said he thought file renames were very important. He said, "Although Linus is correct in that an SCM doesn't *have* to handle this, it really feels like shooting for mediocracy to me. We might as well design it right from the beginning." Linus replied:

No. git is not an SCM. it's a filesystem designed to _host_ an SCM, and that _is_ doing it right from the beginning.

Keep the abstractions clean. Do _not_ get confused into thinking that git is an SCM. If you think of it that way, you'll end up with crap you can't think about.

And at a filesystem layer, "rename" already exists. It's moving an object to a new name in a tree. git already does that very well, thank you very much.

But a filesystem rename is _not_ the same thing as an SCM rename. An SCM rename is built on top of a filesystem rename, but it has its own issues that may or may not make sense for the filesystem.

H. Peter pointed out that he hadn't been talking about git per se, but just the SCM layered on top of it. Zach Welch replied, "I imagine quite a few folks expect something not entirely unlike an SCM to emerge from these current efforts. Moreover, Petr's 'git' scripts wrap your "filesystem" plumbing to that very end. To avoid confusion, I think it would be better to distinguish the two layers, perhaps by calling the low-level plumbing... 'gitfs', of course." Andrew Timberlake-Newell suggested, "Or perhaps to come up with a name (or at least nickname) for the SCM. GitMaster?" Steven Cole replied:

Cogito. "Git inside" can be the first slogan.

Differentiating the SCM built on top of git from git itself is probably worthwhile to avoid confusion. Other SCMs may be developed later, built on git, and these can come up with their own clever names.

Petr Baudis suggested 'tig', and H. Peter said he liked 'Cogito': "it's a real name, plus it'd be a good use for the otherwise-pretty-useless two-letter combination "cg"." Petr replied, "Duh, believe me or not but I completely missed the "Cogito" part of Steven's mail. Of course, I like it too."

5. git Mailing List Archives Available

14 Apr 2005 (1 post) Subject: "[ANNOUNCE] Archives of git at MARC"

People: Hank Leininger

Hank Leininger said:

Courtesy of Randy Dunlap, MARC now has full(?) archives for the git list, actually stretching a bit back to the pre-git-list discussions on linux-kernel, available at:

6. Two More Mailing List Archives Available

14 Apr 2005 - 15 Apr 2005 Archive Link: "Git archive now available"

People: Darren Williams

Darren Williams said:

Thanks to the team at Gelato@UNSW we now have a no so complete Git archive at

If somebody could send me a complete Git mbox I will update the archive with it.

Kenneth Johansson pointed out that there was also another list archive at

7. Introduction Of 'git fork' To Cogito; Big UI Changes In The Works

15 Apr 2005 - 20 Apr 2005 (29 posts) Subject: "[PATCH] Add "clone" support to lntree"

Topics: CVS, commit-id, git fork, git merge, git pull

People: Linus TorvaldsPetr BaudisDavid A. Wheeler

Daniel Barkalow and Petr Baudis created the ability to do branching in Cogito. Petr's 'fork' command created a local branch of a repository, starting from a certain point in that repository's history. The branch occupied its own directory structure, completely independent from the original repository; and could be developed on its own, and then merged back to the main repository. But Linus Torvalds said:

I'm wondering why you talk about "branches" at all.

No such thing should exist. There are no branches. There are just repositories. You can track somebody elses repository, but you should track it by location, not by any "branch name".

And you track it by just merging it.

Yeah, we don't have really usable merges yet, but..

Petr replied:

First, this "level" of branches concerns multiple working directories tied to a single repository. It seems like a sensible thing to do; and you agreed with it too (IIRC). And when you do that, git-pasky just saves some work for you. For git-pasky, branch is really just a symbolic name for a commit ID, which gets updated every time you commit in some repository. Nothing more.

So the whole point of this is to have a symbolic name for some other working directory. When you want to merge, you don't need to go over to the other directory, do commit-id, cut'n'paste, and feed that to git merge. You just do

git merge myotherbranch

Now, about remote repositories. When you pull a remote repository, that does not mean it has to be immediately merged somewhere. It is very useful to have another branch you do *not* want to merge, but you want to do diffs to it, or even check it out / export it later to some separate directory. Again, the "branch" is just a symbolic name for the head commit ID of what you pulled, and the pointer gets updated every time you pull again - that's the whole point of it.

The last concept are "tracking" working directories. If you pull the tracked branch to this directory, it also automerges it. This is useful when you have a single canonical branch for this directory, which it should always mirror. That would be the case e.g. for the gazillions of Linux users who would like to just have the latest bleeding kernel of your, and they expect to use git just like a "different CVS". Basically, they will just do

git pull

instead of

cvs update


Elsewhere, Petr said:

I've removed git branch, removed the possibility for git update to switch branches and renamed git update to git seek. You can do

git seek git-pasky-0.1

and examine stuff, but your tree is also blocked at the same time - git won't let you commit, merge and such. By doing

git seek
git seek master

you return back to your branch (assuming its name is master).

I think git fork is after all good enough for branching and it is the clean way. Shall there be a big demand for it, it should be minimal hassle to implement 'git switch', which would do that.

The discussion changed course here, with David A. Wheeler suggesting, "changing "pull" to ONLY download, and "update" to pull AND merge. Whenever you want to update, just say "git update", end of story." There were a few brief comments, and Petr said, "I start to like the pull/update distinction, and I think I'll go for it." He also added later, "These naming issues may appear silly but I think they matter big time for usability, intuitiveness, and learning curve (I don't want git-pasky become another GNU arch)."

The discussion continued, with an examination of 'git cancel', the current way to undo any local changes; and whether to rename that command to 'git revert', 'git checkout', or something else.

8. Full 3G Kernel History Successfully Imported Into git

16 Apr 2005 - 19 Apr 2005 (42 posts) Subject: "full kernel history, in patchset format"

Topics: BitKeeper, CVS, SHA1

People: Ingo MolnarLinus TorvaldsPetr BaudisJan-Benedict GlawThomas GleixnerDavid Lang

Ingo Molnar said:

i've converted the Linux kernel CVS tree into 'flat patchset' format, which gave a series of 28237 separate patches. (Each patch represents a changeset, in the order they were applied. I've used the cvsps utility.)

the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a script that will apply all the patches in order and will create a pristine 2.6.12-rc2 tree.

it needed many hours to finish, on a very fast server with tons of RAM, and it also needed a fair amount of manual work to extract it and to make it usable, so i guessed others might want to use the end result as well, to try and generate large GIT repositories from them (or to run analysis over the patches, etc.).

the patches contain all the existing metadata, dates, log messages and revision history. (What i think is missing is the BK tree merge information, but i'm not sure we want/need to convert them to GIT.)

it's a 136 MB tarball, which can be downloaded from:

the ./generate-2.6.12-rc2 script generates the 2.6.12-rc2 tree into linux/, from scratch. (No pre-existing kernel is needed, as 2.patch generates the full 2.4.0 kernel tree.) The patching takes a couple of minutes to finish, on a fast box.

below i've attached a sample patch from the series.

note: i kept the patches the cvsps utility generated as-is, to have a verifiable base to work on. There were a very small amount of deltas missed (about a dozen), probably resulting from CVS related errors, these are included in the diff-CVS-to-real patch. Also, the patch format cannot create the Documentation/logo.gif file, so the script does this too - just to be able to generate a complete 2.6.12-rc2 tree that is byte-for-byte identical to the real thing.

Linus Torvalds replied:

Hey, that's great. I got the CVS repo too, and I was looking at it, but the more I looked at it, the more I felt that the main reason I want to import it into git ends up being to validate that my size estimates are at all realistic.

I see that Thomas Gleixner seems to have done that already, and come to a figure of 3.2GB for the last three years, which I'm very happy with, mainly because it seems to match my estimates to a tee. Which means that I just feel that much more confident about git actually being able to handle the kernel long-term, and not just as a stop-gap measure.

But I wonder if we actually want to actually populate the whole history.. Now that my size estimates have been verified, I have little actual real reason to put the history into git. There are no visualization tools done for git yet, and no helpers to actually find problems, and by the time there will be, we'll have new history.

So I'd _almost_ suggest just starting from a clean slate after all. Keeping the old history around, of course, but not necessarily putting it into git now. It would just force everybody who is getting used to git in the first place to work with a 3GB archive from day one, rather than getting into it a bit more gradually.

What do people think? I'm not so much worried about the data itself: the git architecture is _so_ damn simple that now that the size estimate has been confirmed, that I don't think it would be a problem per se to put 3.2GB into the archive. But it will bog down "rsync" horribly, so it will actually hurt synchronization untill somebody writes the rev-tree-like stuff to communicate changes more efficiently..

IOW, it smells to me like we don't have the infrastructure to really work with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can build up the infrastructure in parallell with starting to really need it.

But it's _great_ to have the history in this format, especially since looking at CVS just reminded me how much I hated it.

Petr Baudis said it would be fine with him to start fresh, though "Perhaps we should have a separate GIT repository with the previous history though, and in the first new commit the parent could point to the last commit from the other repository." Jan-Benedict Glaw also replied to Linus, saying:

3GB is quite some data, but I'd accept and prefer to download it from somewhere. I think that it's worth it.

I accept that there are people out there which would love to get a smaller archive, but at least most developers that would actually use it for day-to-day work *do* have the bandwidth to download it. Maybe we'd also prepare (from time to time) bzip'ed tarballs, which I expect to be a tad smaller.

Elsewhere, Thomas Gleixner though it would be OK to "export the 2.6.12-rc2 version of the git'ed history tree and start from there. Then the first changeset has a parent, which just lives in a different place. Thats the only difference to your repository, but it would change the sha1 sums of all your changesets."

Elsewhere, David Lang said, "at least start with a full release. say 2.6.11 the history won't be blank, but it's far more likly that people will care about the details between 2.6.11 and 2.6.12 and will want to go back before -rc2"

Ingo also replied to Linus, saying:

it definitely feels a bit brave to import 28,000 changesets into a source-code database project that will be a whopping 2 weeks old in 2 days ;) Even if we felt 100% confident about all the basics (which we do of course ;), it's just simply too young to tie things down via a 3.2GB database. It feels much more natural to grow it gradually, 28,000 changesets i'm afraid would just suffocate the 'project growth dynamics'. Not going too fast is just as important as not going too slow.

I didnt generate the patchset to get it added into some central repository right now, i generated it to check that we _do_ have all the revision history in an easy to understand format which does generate today's kernel tree, so that we can lean back and worry about the full database once things get a bit more settled down (in a couple of months or so). It's also an easy testbed for GIT itself.

but the revision history was one of the main reasons i used BK myself, so we'll need a merged database eventually. Occasionally i needed to check who was the one who touched a particular piece of code - was that fantastic new line of code written by me, or was that buggy piece of crap written by someone else? ;) Also, looking at a change and then going to the changeset that did it, and then looking at the full picture was pretty useful too. So that sort of annotation, and generally navigating around _quickly_ and looking at the 'flow' of changes going into a particular file was really useful (for me).

9. Darcs Attempts git Interface; Some git Goals; Prospect Of libgit

16 Apr 2005 - 18 Apr 2005 (13 posts) Subject: "using git directory cache code in darcs?"

Topics: Darcs, SHA1

People: David RoundyLinus TorvaldsPaul Jackson

David Roundy (the darcs maintainer) said:

I've been thinking about the possibility of using the git "current directory cache" code in darcs. Darcs already has an abstraction layer over its pristine directory cache, so this shouldn't be too hard--provided the git code is understandable. The default in darcs is currently to use an actual directory ("/git/_darcs/current/index.html") as the cache, and we synchronize the file modification times in the cache with those of identical files in the working directory to speed up compares. We (the darcs developers) have talked for some time about introducing a single-file directory cache, but noone ever got around to it, partly because there wasn't a particularly compelling need.

It seems that the git directory cache is precisely what we want. Also, if we switch to (optionally) using the git directory cache, I imagine it'll make interfacing with git a lot easier. And, of course, it would significantly speed up a number of darcs commands, which are limited by the slowness of the readdir-related code. We haven't tracked this down why this is, but a recursive directory compare in which we readdir only one of the directories (since we don't care about new files in the other one) takes half the time of a compare in which we readdir both directories.

Linus Torvalds thought it would be great if David explored this idea, though he did say, "I really don't know how well the git index file will work with darcs, and the main issue is that the index file names the "stable copy" using the sha1 hash. If darcs uses something else (and I imagine it does) you'd need to do a fair amount of surgery, and I suspect merging changes won't be very easy." He suggested waiting until git had stablized a bit more, before attempting something like that. But he did add, "My gut _feel_ is that the basic git low-level architecture is done, and you can certainly start looking around and see if it matches darcs at all."

David began his explorations, and Linus remarked:

one of my hopes was that other SCM's could just use the git plumbing. But then I'd really suggest that you use "git" itself, not any "libgit". Ie you take _all_ the plumbing as real programs, and instead of trying to link against individual routines, you'd _script_ it.

In other words, "git" would be an independent cache of the real SCM, and/or the "old history" (ie an SCM that uses git could decide that the git stuff is fine for archival, and really use git as the base: and then the SCM could entirely concentrate on _only_ the "interesting" parts, ie the actual merging etc).

That was really what I always personally saw "git" as, just the plumbing beneath the surface. For example, something like arch, which is based on "patches and tar-balls" (I think darcs is similar in that respect), could use git as a _hell_ of a better "history of tar-balls".

The thing is, unless you take the git object database approach, using _just_ the index part doesn't really mean all that much. Sure, you could just keep the "current objects" in the object database, but quite frankly, there would probably not be a whole lot of point to that. You'd waste so much time pruning and synchronizing with your "real" database that I suspect you'd be better off not using it.

Paul Jackson was very much in favor of a 'libgit', saying, "Trying to make the executable 'git' commands the bottom layer of the user implementation stack forces inefficiencies on higher layers of the stack, and thus encourages stupid workarounds and cheats in an effort to speed things up." He urged Linus to encourage a 'libgit' implementation, but Linus replied:

Not until all the data structures are really really stable.

That's the thing - we can keep the _program_ interfaces somewhat stable. But internally we may change stuff wildly, and anybody who depends on a library interface would be screwed.

Ergo: no library interfaces yet. Wait for it to stabilize. Start trying to just script the programs.

10. FastCST Version Control System Similar To git

16 Apr 2005 (1 post) Subject: "Introductions"

Topics: FastCST

People: Zed A. Shaw

Zed A. Shaw said:

Just a short message to introduce myself and give a shameless plug. I'm Zed A. Shaw and I'm the author of a little unknown SCM called FastCST ( While I doubt that Linus would ever adopt fastcst as his tool (and I probably wouldn't want him too since it's not quite ready for prime time) I did find many of the discussions on the list so far very interesting.

Some sent me Linus' message about wanting to do a diff on the whole source tree, and just thought I'd mention that I already tried this in FastCST. FastCST uses a suffix array to construct a delta (not a diff), so I thought it might be possible to simply apply the delta algorithm to the entire source tree and get very small changesets.

It worked on small source trees, but when it came to the Linux 2.6 tree it choked hard. Even with an efficient suffix array implementation, you're talking about performing a diff/delta on 225M of source. Added to the problem is that you have to track file locations within the massive blob. In the end, it also wasn't much more efficient from a size/space/time perspective so I dropped it.

My current solution to Linus' problem is to use an inverted index to process all the sources and revisions on the fly as they are created. Using the inverted index, I'm able to VERY quickly find any chunk of source in files or revisions. This lets me track things like how functions move through the files, where chunks of code moved to, etc. In the end this turns out to be much more efficient (7 seconds on my computer to find all references to "sprintf" in the Linux 2.6 source) as I can use the super small deltas for distributing changes, and give developers a means tracking content changes across "the world" in a simple search format.

Anyway, just thought I'd throw in my experiences attempting what Linus is talking about. I actually agree with him that rename tracking isn't that great, but I've come to the conclusion that tracking renames is actually a specific case of just a general search problem. Different strokes for different folks I guess.

Other than that, I'm mostly interested in reading the messages and probably won't write anything unless people ask me directly for something. Thanks!

11. Overhauling Cogito's Script

16 Apr 2005 (1 post) Subject: "git-pasky: overhaul"


People: Rene Scharfe

Rene Scharfe said to Petr Baudis:

I just couldn't stand all the calls to grep and other external tools in and started rewriting it in a knee-jerk reaction.

You said in a private conversation that you don't like to include things like ${var#stuff} to stay "sh compatible", while OTOH you favour $(cmd) over `cmd`. Both are POSIX extensions of the classical Bourne Shell syntax (see e.g. for a feature comparision between POSIX shell, Bourne Shell and Korn Shells on HP-UX). For reference, The Open Group publishes its IEEE Std 1003.1 standard (vulgo: POSIX) on this website: So which shell do you want to target with your git scripts?

This time I tested the script. :] It copes with invalid IDs, non-existing valid IDs, abbreviated IDs, an omitted ID, valid IDs, with tags and branch names. I also made sure the script runs with bash, ash, pdksh, zsh and bash --posix (all on SuSE 9.2).

I changed the way an ID is verified. The script now tries to find tags and branches first by looking for .git/tags/<id> and .git/HEAD.<id> and after that looking inside .git/objects for a match. That's faster and now I can safely give a branch a name consisting of 40 hex digits. :-)

The script follows in plain text format, not as a patch. Your and my version share only very few lines, so this way it's easier to review. I'll send you a patch if and when you're ready to apply it, ok?

12. Storing File Permissions In git

16 Apr 2005 - 17 Apr 2005 (16 posts) Subject: "Storing permissions"

Topics: SHA1, read-tree, update-cache

People: Martin MaresPaul JacksonLinus TorvaldsDavid A. Wheeler

Martin Mares said:

I frequenty run into problems with file permissions -- some archives (including the master git archive) contain group-writable files, but when I check them out, the permissions get trimmed by my umask (quite sensibly) and update-cache complains that they need update.

Does it really make sense to store full permissions in the trees? I think that remembering the x-bit should be good enough for almost all purposes and the other permissions should be left to the local environment.

A couple of posts down the line, Paul Jackson clarified why a single bit would be sufficient. He said, "If any of the execute permissions of the incoming file are set, then the bit is stored ON, else it is stored OFF. On 'checkout', if the bit is ON, then the file permission is set mode 0777 (modulo umask), else it is set mode 0666 (modulo umask)." Linus Torvalds replied:

I think I agree.

Anybody willing to send me a patch? One issue is that if done the obvious way it's an incompatible change, and old tree objects won't be valid any more. It might be ok to just change the "compare cache" check to only care about a few bits, though: S_IXUSR and S_IFDIR. And then always write new "tree" objects out with mode set to one of

Then, at compare time, we only look at S_IXUSR matching for files (we never compare directory modes anyway). And at file create time, we create them with 0666 and 0777 respectively, and let the users umask sort it out (and if the user has 0100 set in his umask, he can damn well blame himself).

This would pretty much match the existing kernel tree, for example. We'd end up with some new trees there (and in git), but not a lot of incompatibility. And old trees would still work fine, they'd just get written out differently.

He replied to his own patch request with:

Actually, I just did it. Seems to work for the only test-case I tried, namely I just committed it, and checked that the permissions all ended up being recorded as 0644 in the tree (if it has the -x bit set, they get recorded as 0755).

When checking out, we always check out with 0666 or 0777, and just let umask do its thing. We only test bit 0100 when checking for differences.

Maybe I missed some case, but this does indeed seem saner than the "try to restore all bits" case. If somebody sees any problems, please holler.

(Btw, you may or may not need to blow away your "index" file by just re-creating it with a "read-tree" after you've updated to this. I _tried_ to make sure that the compare just ignored the ce_mode bits, but the fact is, your index file may be "corrupt" in the sense that it has permission sets that sparse expects to never generate in an index file any more..)

David A. Wheeler pointed out, "There's a minor reason to write out ALL the perm bit data, but only care about a few bits coming back in: Some people use SCM systems as a generalized backup system, so you can back up your system to an arbitrary known state in the past (e.g., "Change my /etc files to the state I was at just before I installed that &*#@ program!")." Linus replied:

Yes. I was actually thinking about having system config files in a git repository when I started it, since I noticed how nicely it would do exactly that.

However, since the mode bits also end up being part of the name of the tree object (ie they are most certainly part of the hash), it's really basically impossible to only care about one bit but writing out many bits: it's the same issue of having multiple "identical" blocks with different names.

It's ok if it happens occasionally (it _will_ happen at the point of a tree conversion to the new format, for example), but it's not ok if it happens all the time - which it would, since some people have umask 002 (and individual groups) and others have umask 022 (and shared groups), and I can imagine that some anal people have umask 0077 ("I don't want to play with others").

The trees would constantly bounce between a million different combinations (since _some_ files would be checked out with the "other" mode).

At least if you always honor umask or always totally ignore umask, you get a nice repetable thing. We tried the "always ignore" umask thing, and the problem with that is that while _git_ ended up always doing a "fchmod()" to reset the whole permission mask, anybody who created files any other way and then checked them in would end up using umask.

One solution is to tell git with a command line flag and/or config file entry that "for this repo, I want you to honor all bits". That should be easy enough to add at some point, and then you really get what you want.

That said, git won't be really good at doing system backup. I actually _do_ save a full 32-bit of "mode" (hey, you could have "immutable" bits etc set), but anybody who does anything fancy at all with mtime would be screwed, for example.

Also, right now we don't actually save any other type of file than regular/directory, so you'd have to come up with a good save-format for symlinks (easy, I guess - just make a "link" blob) and device nodes (that one probably should be saved in the "cache_entry" itself, possibly encoded where the sha1 hash normally is).

Also, I made a design decision that git only cares about non-dotfiles. Git literally never sees or looks at _anything_ that starts with a ".". I think that's absolutely the right thing to do for an SCM (if you hide your files, I really don't think you should expect the SCM to see it), but it's obviously not the right thing for a backup thing.

(It _might_ be the right thing for a system config file, though, eg tracking something like "/etc/index.html" with git might be ok, modulo the other issues).

David liked the idea of being able to tell git to honor all permissions bits for a given repository. He added, "My real concern is I'm looking at the early design of the storage format so that it's POSSIBLE to extend git in obvious ways. As long as it's possible later, then that's a great thing."

13. Doing Real Kernel Work; Status Of merge

16 Apr 2005 - 18 Apr 2005 (60 posts) Subject: "Re-done kernel archive - real one?"

Topics: BitKeeper, Subversion, commit-tree, git status, read-tree, write-tree

People: Linus TorvaldsRussell KingDavid WoodhouseH. Peter AnvinPetr Baudis

Linus Torvalds said:

Ok, nobody really objected to the notion of leaving the kernel history behind for now, and in fact most people seemed to basically agree. So with that decided, the old kernel testing tree was actually perfectly ok, except it had been build up with the old-style commit date handling, which made me not want to use it as a base for any real work.

So I re-created the dang thing (hey, it takes just a few minutes), and pushed it out, and there's now an archive on in my public "personal" directory called "linux-2.6.git". I'll continue the tradition of naming git-archive directories as "*.git", since that really ends up being the ".git" directory for the checked-out thing.

I'm not going to announce it on linux-kernel yet, because I don't think it's useful to anybody but a git person anyway. Besides, I don't actually know how happy the people are about this distribution method and whether it ends up being a horrible disaster for the mirroring setup.

Peter made some noises about /pub/scm, which makes sense, and would be a better place than my public tree. Apparently there are other places that are willing and able to host things too, so we'll see.

NOTE! The roughly 10x expansion of archive size goind from BK to git ends up in a similar 10x bandwidth expansion, in addition to just the overhead of reading tons of directory entries and comparing them (which is what both a wget and rsync thing ends up doing). I'm sure we can bring that down with smarter synchronization tools, but I also suspect that's some way away.

So is real common usage, though, so maybe it's not that bad at all. Who knows. We haven't hit a single real snag so far (except it took several days longer than I expected, but hey, I expect lots of things ;), and I'm sure real usage will show lots of them.

Similarly, we don't really have real merging, which makes tracking harder, but I suspect actually having a tree out there will make people more motivated and have more of a test-case. I'm feeling good enough about the plumbing that I think I solved the "hard" part of it, and now it's just the boring 95% left - scripting around it.

I think that with the new merge model, the easiest thing to do is to just download all new objects, and then download the HEAD file under a new name.

Ie we have two phases to the merge: first get the objects, with something like
rsync --ignore-existing -acv $(repo)/ .git/

which will _not_ download the new HEAD file (since you already have one of your own), and then when you actually decide to merge you do

rsync -acv $(repo)/HEAD .git/MERGE_WITH

and now you can look at your old HEAD, and the MERGE_WITH thing, look up the parents, and then do

read-tree -m <parent-tree> <head-tree> <merge-with-tree>
commit-tree <result-tree> -p <head-tree> -p <merge-with-tree>

(which should actually _work_, assuming that the merge had no file conflicts).

This seems to be a sane way to do merges, and if the scripting starts from there and then becomes smarter...

Russell King said, "We need to work out how we're going to manage to get our git changes to you. At the moment, I've very little idea how to do that. Ideas?" Linus replied:

To me, merging is my highest priority. I suspect that once I have a tree from you (or anybody else) that I actually _test_ merging with, I'll be motivated as hell to make sure that my plumbing actually works.

After all, it's not just you who want to have to avoid the pain of merging: it's definitely in my own best interests to make merging as easy as possible. You're _the_ most obvious initial candidate, because your merges almost never have any conflicts at all, even on a file level (much less within a file).

Completely elsewhere, David Woodhouse said, regarding Linus's initial post, "Do you want the commits list running for it yet? Do you want the changesets which are already in it re-mailed without a 'TESTING' tag?" Linus replied:

I really don't know. I'm actually very happy where this thing is right now, and completing that first merge successfully was a big milestone to me personally. That said, actually _using_ this thing is not for the faint-of-heart, and while I think "git" already is showing itself to be useful, I'm very very biased.

In other words, I really wonder what an outsider that doesn't have the same kind of mental bias thinks of the current git tree. Is it useful, or is it still just a toy for Linus to test out his crazy SCM-wannabe.

Can people usefully track my current kernel git repository, or do you have to be crazy to do so? That's really the question. You be the judge. Me, I'm just giddy from a merge that was clearly done using interfaces that aren't actually really usable for anybody but me, and barely me at that ;)

Btw, I also do want this to show up in the BK trees for people who use BitKeeper - the same way we always supported tar-ball + patch users before. So I'll have to try to come up with some sane way to do that too. Any ideas? The first series of 198 patches is obvious enough and can be just done that way direcly, but the merge..

H. Peter Anvin said, "I have set up /pub/scm/linux/kernel/git on Everyone who had directories in /pub/linux/kernel/people now have directories in /pub/scm/linux/kernel/git. For non-kernel trees it would probably be better to have different trees, however; I also would like to request that git itself is moved to /pub/software/scm/git; I have created that directory and made it owned by Linus."

Russell said to Linus, regarding the ease of tracking his tree, "I guess I'll have the pleasure to find that out when I update my tree with your latest changes... which I think is a project for tomorrow." He replied to himself, "I pulled it tonight into a pristine tree (which of course worked.) In doing so, I noticed that I'd messed up one of the commits - there's a missing new file. Grr. I'll put that down to being a newbie git." Linus was thrilled to see that the merge worked, and added, "Actually, you should put that down to horribly bad interface tools. With BK, we had these nice tools that pointed out that there were files that you might want to commit (ie "bk citool"), and made this very obvious. Tools absolutely matter. And it will take time for us to build up that kind of helper infrastructure. So being newbie might be part of it, but it's the smaller part, I say. Rough interfaces is a big issue." Petr Baudis replied, "I just committed some simple git status, which is equivalent to svn status or cvs update (except it does no update). So it shows all the files not tracked by git with a question mark in front of them. This will need some ignore rules, though (currently it just ignores *.o and the tags file). Now it turns out that it is rather unfortunate that git ignores hidden files, since this would be a perfect object for that - I think it is useful to have the ignore list tracked by git. I think I will just name it git-ignores to be found in the working directory for now."

Elsewhere, Russell tried doing a merge from Linus's tree, but the whole thing blew up in his face. Linus said this was to be expected, as Cogito had not incorporated the latest merging ideas.

14. First Successful Linux Kernel Merge Using git

17 Apr 2005 (50 posts) Subject: "[0/5] Patch set for various things"

Topics: checkout-cache, commit-tree, fsck-cache, git merge, merge-base, read-tree, update-cache, write-tree

People: Linus Torvalds

In the course of discussion, Linus Torvalds announced the first Linux kernel code merge using git. He said:

It may not be pretty, but it seems to have worked fine!

Here's my history log (with intermediate checking removed - I was being pretty anal ;):

rsync -avz --ignore-existing .git/
rsync -avz --ignore-existing .git/MERGE-HEAD
merge-base $(cat .git/HEAD) $(cat .git/MERGE-HEAD)
for i in e7905b2f22eb5d5308c9122b9c06c2d02473dd4f $(cat .git/HEAD) $(cat .git/MERGE-HEAD); do cat-file commit $i | head -1; done
read-tree -m cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6 9c78e08d12ae8189f3bd5e03accc39e3f08e45c9 a43c4447b2edc9fb01a6369f10c1165de4494c88
commit-tree 7792a93eddb3f9b8e3115daab8adb3030f258ce6 -p $(cat .git/HEAD) -p $(cat .git/MERGE-HEAD)
echo 5fa17ec1c56589476c7c6a2712b10c81b3d5f85a > .git/HEAD
fsck-cache --unreachable 5fa17ec1c56589476c7c6a2712b10c81b3d5f85a

which looks really messy, because I really wanted to do each step slowly by hand, so those magic revision numbers are just cut-and-pasted from the results that all the previous stages had printed out.

NOTE! As expected, this merge had absolutely zero file-level clashes, which is why I could just do the "read-tree -m" followed by a write-tree. But it's a real merge: I had some extra commits in my tree that were not in Russell's tree, and obviously vice versa.

Also note! The end result is not actually written back to the corrent working directory, so to see what the merge result actually is, there's another final phase:

read-tree 7792a93eddb3f9b8e3115daab8adb3030f258ce6
update-cache --refresh
checkout-cache -f -a

which just updates the current working directory to the results. I'm _not_ caring about old dirty state for now - the theory was to get this thing working first, and worry about making it nice to use later.

A second note: a real "merge" thing should notice that if the "merge-base" output ends up being one of the inputs (it one side is a strict subset of the other side), then the merge itself should never be done, and the script should just update directly to which-ever is non-common HEAD.

But as far as I can tell, this really did work out correctly and 100% according to plan. As a result, if you update to my current tree, the top-of-tree commit should be:

cat-file commit $(cat .git/HEAD)

tree 7792a93eddb3f9b8e3115daab8adb3030f258ce6
parent 8173055926cdb8534fbaed517a792bd45aed8377
parent df4449813c900973841d0fa5a9e9bc7186956e1e
author Linus Torvalds <> 1113774444 -0700
committer Linus Torvalds <> 1113774444 -0700

Merge with - ARM changes

First ever true git merge. Let's see if it actually works.

Yehaa! It did take basically zero time, btw. Except for my bunbling about, and the first "rsync the objects from rmk's directory" part (which wasn't horrible, it just wasn't instantaneous like the other phases).

Btw, to see the output, you really want to have a "git log" that sorts by date. I had an old "" that did the old recursive thing, and while it shows the right thing, the ordering ended up making it be very non-obvious that rmk's changes had been added recently, since they ended up being at the very bottom.

15. monotone-viz Ported To git-viz

17 Apr 2005 - 21 Apr 2005 (17 posts) Subject: "git-viz tool for visualising commit trees"

Topics: Monotone, git-viz

People: Petr BaudisIngo MolnarOlivier Andrieu

Petr Baudis said:

Olivier Andrieu was kind enough to port his monotone-viz tool to git ( - use the one from the monotone repository). The tool visualizes the history flow nicely; see

for some screenshots.

Ingo Molnar replied, "really nice stuff! Any plans to include it in git-pasky, via 'git gui' option or so? Also, which particular version has this included - the freshest tarball on the monotone-viz download site doesnt seem to include it." Petr replied:

AFAIK you need Monotone and grab it from the monotone repository.

git gui sounds interesting, but perhaps in longer horizon, and perhaps not as an integral part of git-pasky. I don't know ocaml and it's rather large thing.

Point'n'drag merges, anyone? ;-))

Olivier Andrieu also said to Ingo:

here's a tarball :

and a binary, compiled on Fedora Core 3 :

Please, bear in mind that this is really a hack. Since monotone and git has very similar concepts, I merely replaced the code that was accessing monotone's database (sqlite) by some code using git tools. But the UI still has references to monotone all over the place, a couple of things won't work, etc.

After some compilation shenanigans, Ingo got git-viz working, and said:

I just checked how the kernel repository looks like with it, and i'm impressed! The GUI is top-notch, and the whole graph output and navigation is very mature visually. Kudos!

a couple of suggestions that are in the 'taste' category:

i guess you know it, and i'm definitely not complaining about prototype code, but rendering is quite slow: drawing the 340 changesets in the current kernel repository takes 15 seconds on a 2 GHz P4. Drawing the full kernel history (63,000 changesets) would take more than 45 minutes on this box.

the current rate of kernel development is ~2000 changesets per month, so drawing the kernel history will get 3 seconds slower every day - it will exceed 1 minute in 20 days, so this will become a pressing issue quite soon i suspect.

16. git Tutorial

17 Apr 2005 - 18 Apr 2005 (2 posts) Subject: "yet another gitting started"

People: Randy Dunlap

Randy Dunlap said:

Here's the beginnings of yet another git usage/howto/tutorial.

It can grow or die... I'll gladly take patches for it, or Pasky et al can merge more git plumbing and toilet usages into it, with or without me.

Alex Riesen found this extremely useful for bootstrapping his own git explorations.

17. Three-Way-Merge Proof-Of-Concept

18 Apr 2005 (5 posts) Subject: "[PATCH] fix bug in read-cache.c which loses files when merging a"

People: Linus Torvalds

In the course of discussion, Linus Torvalds said:

I've pushed out my first real content merge: since Daniel Barkalow's object model stuff didn't apply to my tree any more (I had added the commit type tracking to mine after Daniel did his conversion), I instead applied his series to the place they were done against, and used git to merge the result with my current top-of-tree.

I based it on the two example scripts I had sent out, but obviously never tested until this point (since both of them had some serious syntax errors, and thus clearly wouldn't work).

I also checked in the stupid scripts, in the expectation that somebody else can improve on them and make them useful. For example, firing up an editor when the merge fails is probably a damn good idea.

Anyway, it seems to prove the concept of a real three-way merge, and it all actually worked exactly the way I envisioned. Whether the end result works or not, that's a different issue ;)

18. SCSI Trees Working With git

18 Apr 2005 (13 posts) Subject: "SCSI trees, merges and git status"

Topics: BitKeeper

People: James Bottomley

James Bottomley said:

As of today, I have two SCSI git trees operational:




The latter has a non trivial merge in it because of a conflict in scsi_device.h, so merges actually do work ...

The trees are exported from BK a changeset at a time (except the merge bits, which were done manually). I'll continue to accumulate patches in the BK trees for the time being since we don't have a nice web browser interface for the git trees (and also my commit scripts are all BK based).

Linus, the rc-fixes repo is ready for applying ... it's the same one I announced on linux-scsi and lkml a while ago just with the git date information updated to be correct (the misc one should wait until after 2.6.12 is final).

Linus tried a merge, but ran into some problems along the way, and were unable to verify that the merge actually took place correctly. After some debugging, it seemed the problem was not so huge. James cleaned out the mess and they tried again. After retracing his steps, James was able to fix the problem, and the merge occurred correctly.

19. git-pasky (Cogito) 0.5 Released

18 Apr 2005 (1 post) Subject: "[ANNOUNCE] git-pasky-0.5"

Topics: git fork, git init, git merge, git pull, git status

People: Petr Baudis

Petr Baudis announced git-pasky (not yet renamed to Cogito) version 0.5, saying:

here finally goes git-pasky-0.5, my set of scripts upon Linus Torvald's git, which aims to provide a humanly usable interface, to a degree similar to a SCM tool. You can get it at

See the READMEs etc for some introduction.

This contains plenty of changes, it's difficult to sum it up. It has been reworked to better support the concept of branches; you can create local branches which share the GIT object repository by git fork. There is also git init which will yet you start a new GIT object repository (possibly seeding it from some rsync URL), git status, better git log, much cleaner concept of tracking (and consequently simpler yet better git pull). Of course it contains the latest updates from Linus' branch too.

There is also git merge, which does some merging, but note well that it is vastly inferior to what we _can_ do (and what I will do now). Expect 0.6 soon where git merge will actually make use of the merging facilities. I released 0.5 basically only because I have been postponing it so long that I really feel ashamed of myself. ;-)

20. Committer Identity Control For git Commits

18 Apr 2005 - 19 Apr 2005 (10 posts) Subject: "[PATCH] provide better committer information to commit-tree.c"

Topics: commit-tree

People: Greg KHLinus Torvalds

Greg KH said:

Here's a small patch to commit-tree.c that does two things:

This allows people to set sane values for the commit names and email addresses, preventing odd, private hostnames and domains from being exposed to the world.

Linus Torvalds replied, "Gaah, I really was hoping that people wouldn't feel like they have to lie about their committer information. I guess we don't have much choice, but I'm not happy about it." Greg replied:

Well Russell has stated that he has to for EU Privacy reasons. And I'd like to do it as I don't have a local hostname for my laptop and my employer probably doesn't really want my address showing up :)

But if you really don't like it, and you don't want anyone trying to hide anything, at least allow for a proper domainname. On my boxes, the domainname doesn't show up at all without that patch (just the getdomainname() part). I'll split it out if you want.

Chris Wedgwood also spoke up in favor of the patch, and Linus said, "I ended up applying Greg's patch, since it's clear that many people want to do this."

21. New gitweb Web Interface For Linux 2.6 Development

18 Apr 2005 - 19 Apr 2005 (7 posts) Subject: "GIT Web Interface"

Topics: diff-tree

People: Kay SieversPetr Baudis

Kay Sievers said:

I'm hacking on a simple web interface, cause I missed the bkweb too much. It can't do much more than browse through the source tree and show the log now, but that should change... :)

How can I get the files touched with a changeset and the corresponding diffs belonging to it?

Petr Baudis liked the tree, and answered, "diff-tree to get the list of files, you can do the corresponding diffs e.g. by doing git diff -r tree1:tree2. Preferably make a patch for it first to make it possible to diff individual files this way." This worked for Kay, and Greg KH was also impressed.

Note: By the time of this writing, the above link just redirects to the usage.

22. A 2003 Meditation On Ideas Similar To git

18 Apr 2005 - 19 Apr 2005 (4 posts) Subject: "SCM ideas from 2003"

People: Kevin Smith

Kevin Smith discovered a page from 2003 by Yann Dirson entitled, "Version control according to myself". In it, Yann considers some ideas similar to those in git. Kevin said:

Here are the parts that particularly caught my eye:

"what's so special about files ?" where the author suggests that existing SCM systems are so blinded by the tradition of file orientation that they can't see that there might be alternatives.

"As a goodie we can even note that moving a file inside the hierarchy has become exactly similar as moving a code statement." where the author recognizes that renames are merely a special case of code moves.

His implementation ideas are quite different from git, but I thought it was pretty cool to find that someone was thinking about these ideas a couple years ago.

23. New 'wit' Web Interface For git

19 Apr 2005 (1 post) Subject: "wit - a git web interface"

Topics: SHA1, wit

People: Christian Meder

Christian Meder said:

I uploaded a first draft of wit to

Right now it's a minimal web interface on top of git. Unpack it, make sure you've got at least Python 2.3, optionally install c2html, adjust and start from the root with

$ PYTHONPATH=. python git/web/

Point your browser to http://localhost:8090

I append my random notes about this thing:

24. git Optimization; Major Repository Structural Changes

19 Apr 2005 - 20 Apr 2005 (54 posts) Subject: "[PATCH] write-tree performance problems"

Topics: Compression, FS: ext3, SHA1, convert-cache, fsck-cache, read-tree, update-cache, write-tree

People: Linus TorvaldsChris MasonJon Seymour

Chris Mason did some tests and determined that while the quilt patch scripts took 2 seconds to apply 100 patches, git took just over a minute. He posted a patch to bring git's time down to 15 seconds, but this involved taking hash values from a file instead of calculating them 'by hand'. Linus said:

performance is important to me, but even more than performance is that I can trust the end results, and that means that we calculate the hashes instead of just taking them from somewhere else..

What I _would_ like is the ability to re-use an old tree, though. What you really want to do is not pass in a set of directory names and just trust that they are correct, but just pass in a directory to compare with, and if the contents match, you don't need to write out a new one.

I'll try to whip up something that does what you want done, but doesn't need (or take) any untrusted information from the user in the form "trust me, it hasn't changed".

Chris Mason replied, "We already have a "trust me, it hasn't changed" via update-cache. If it gets called wrong the tree won't reflect reality. The patch doesn't change the write-tree default, but does enable you to give write-tree better information about the parts of the tree you want written back to git." Having said that, Chris admitted that he didn't find his patch satisfactory either. He'd written it mainly to demonstrate where the slowdown occurred. The problem was that "I didn't see how to compare against the old tree without reading each tree object from the old tree, and that should be slower then what write-tree does now." Linus replied:

Reading a tree is faster, simply because you uncompress instead of compress. So I can read a tree in 0.28 seconds, but it takes me 0.34 seconds to write one. That said, reading the trees has disk seek issues if it's not in the cache.

What I'd actually prefer to do is to just handle tree caching the same way we handle file caching - in the index.

Ie we could have the index file track "what subtree is this directory associated with", and have a "update-cache --refresh-dir" thing that updates it (and any entry update in that directory obviously removes the dir-cache entry).

Normally we'd not bother and it would never trigger, but it would be useful for your scripted setup it would end up caching all the tree information in a very efficient manner. Totally transparently, apart from the one "--refresh-dir" at the beginning. That one would be slightly expensive (ie would do all the stuff that "write-tree" does, but it would be done just once).

(We could also just make "write-tree" do it _totally_ transparently, but then we're back to having write-tree both read _and_ write the index file, which is a situation that I've been trying to avoid. It's so much easier to verify the correctness of an operation if it is purely "one-way").

I'll think about it. I'd love to speed up write-tree, and keeping track of it in the index is a nice little trick, but it's not quite high enough up on my worries for me to act on it right now.

But if you want to try to see how nasty it would be to add tree index entries to the index file at "write-tree" time automatically, hey...

Chris decided to give this a shot, and Linus suggested:

Start by putting it in at "read-tree" time, and adding the code to invalidate all parent directory indexes when somebody changes a file in the index (ie "update-cache" for anything but a "--refresh").

That would be needed anyway, since those two are the ones that already change the index file.

Once you're sure that you can correctly invalidate the entries (so that you could never use a stale tree entry by mistake), the second stage would be to update it at "write-tree" time.

Chris replied:

This was much easier then I expected, and it seems to be working here. It does slow down the write-tree slightly because we have to write out the index file, but I can get around that with the index file on tmpfs change.

The original write-tree needs .54 seconds to run

write-tree with the index speedup gets that down to .024s (same as my first patch) when nothing has changed. When it has to rewrite the index file because something changed, it's .167s.

I'll finish off the patch once you ok the basics below. My current code works like this:

  1. read-tree will insert index entries for directories. There is no index entry for the root.
  2. update-cache removes index entries for all parents of the file you're updating. So, if you update-cache fs/ext3/inode.c, I remove the index of fs and fs/ext3
  3. If write-tree finds a directory in the index, it uses the sha1 in the cache entry and skips all files/dirs under that directory.
  4. If write-tree detects a subdir with no directory in the index, it calls write_tree the same way it used to. It then inserts a new cache object with the calculated sha1.
  5. right before exiting, write-tree updates the index if it made any changes.

The downside to this setup is that I've got to change other index users to deal with directory entries that are there sometimes and missing other times. The nice part is that I don't have to "invalidate" the directory entry, if it is present, it is valid.

Linus started to critique the logic of this plan, but suddenly said:

Chris, before you do anything further, let me re-consider.

Assuming that the real cost of write-tree is the compression (and I think it is), I really suspect that this ends up being the death-knell to my "use the sha1 of the _compressed_ object" approach. I thought it was clever, and I was ready to ignore the other arguments against it, but if it turns out that we can speed up write-tree a lot by just doing the SHA1 on the uncompressed data, and noticing that we already have the tree before we need to compress it and write it out, then that may be a good enough reason for me to just admit that I was wrong about that decision.

So I'll see if I can turn the current fsck into a "convert into uncompressed format", and do a nice clean format conversion.

Most of git is very format-agnostic, so that shouldn't be that painful. Knock wood.

After working on this for awhile, Linus said:

I converted my git archives (kernel and git itself) to do the SHA1 hash _before_ the compression phase.

So I'll just have to publically admit that everybody who complained about that particular design decision was right. Oh, well.

He went on:

I actually wrote a trivial converter myself, and while I have to say that this object database conversion is a bit painful, the nice thing is that I tried very hard to make it so that the "git" programs will work with both a pre-conversion and a post-conversion database.

The only program where that isn't true is "fsck-cache", since fsck-cache for obvious reasons is very very unhappy if the sha1 of a file doesn't match what it should be. But even there, a post-conversion fsck will eat old objects, it will just warn about a sha1 mismatch (and eventually it will refuse to touch them).

Anyway, what this means is that you should be actually able to get my already-converted git database even using an older version of git: fsck will complain mightily, so don't run it.

What I've done is to just switch the SHA1 calculation and the compression around, but I've left all other data structures in their original format, including the low-level object details like the fact that all objects are tagged with their type and length.

As a result, the _only_ thing that breaks is that a new object will not have a SHA1 that matches the expectations of an old git, but since _checking_ the SHA1 is only done by fsck, not normal operations, all normal ops should work fine.

So to convert your old git setup to a new git setup, do the following:

Doing this on the git repository is nearly instantaneous. Doing it on the kernel takes maybe a minute or so, depending on how fast your machine is.

Sorry about this, but it's a hell of a lot simpler to do it now than it will be after we have lots of users, and I've really tried to make the conversion be as simple and painless as possible.

And while it doesn't matter right now (since git still does exactly the same - I did the minimal changes necessary to get the new hashes, and that's it), this _will_ allow us to notice existing objects before we compress them, and we can now play with different compression levels without it being horribly painful.

Ingo Molnar confirmed that the procedure worked correctly on both repositories he tried. Jon Seymour asked Linus, "Am I correct to understand that with this change, all the objects in the database are still being compressed (so no net performance benefit), but by doing the SHA1 calculations before compression you are keeping open the possibility that at some point in the future you may use a different compression technique (including none at all) for some or all of the objects?" Linus replied:

Correct. There is zero performance benefit to this right now, and the only reason for doing it is because it will allow other things to happen.

Note that the other things include:

  1. change the compression format to make it cheaper
  2. _keep_ the same compression format, but notice that we already have an object by looking at the uncompressed one.

I'm actually leaning towards just #2 at this time. I like how things compress, and it sure is simple. The fact that we use the equivalent of "-9" may be expensive, but the thing is, we don't actually write new files that often, and it's "just" CPU time (no seeking on disk or anything like that), which tends to get cheaper over time.

So I suspect that once I optimize the tree writing to notice that "oh, I already have this tree object", and thus build it up but never compressing it, "write-tree" performance will go up _hugely_ even without removing the compressioin. Because most of the time, write-tree actually only needs to create a couple of small new tree objects.

Elsewhere, Linus posted an additional patch, that "brings down the time to write a kernel tree from 0.34s to 0.24s, so a third of the time was just compressing objects that we ended up already having." At this point, he said:

I'll consider the problem solved for now. Yeah, I realize that it still takes you half a minute to commit the 100 quilt patches, but I just can't bring myself to think it's a huge problem in the kind of usage patterns I think are realistic.

If somebody really wants to replace quilt with git, he'd need to spend some effort on it. If you just want to work together reasonably well, I think 3 patches per second is pretty much there.

The discussion continued, and at one point Linus remarked, "It would be nicer for the cache to make the index file "header" be a "footer", and write it out last - that way we'd be able to do the SHA1 as we write rather than doing a two-pass thing. That's for another time." But a half-hour later he said:

That other time was now.

The header is still a header, but the sha1 is now at the end of the file, which means that the header version has been incremented by 1 (to 2).

This is also sadly an incompatible change, so once you update and install the new tools, you'll need to do

        tree=$(cat-file commit $(cat .git/HEAD) | sed 's/tree //;q')
        read-tree $tree
        update-cache --refresh

to re-build your index file.

Sorry about that, but the end result should be quite fast (especially if your sha1 is fast). The best benchmark is probably to just do a "time update-cache Makefile" in the kernel (before and after), when the cache was already up-to-date and with no time spent on stating lots of files. That kind of "one file changed" timing is actually the common case (in this case Makefile won't have changed, but update-cache doesn't care).

(Of course, I could optimize it to notice that the update-cache didn't do anything and avoid the write altogether, but that's likely optimizing for the wrong case, since normally you'd call update-cache when you know something changed).

Yeah, it's somewhat silly doing optimizations at this point, but I want to make sure that the data structures are all ready for a real release, and as part of that I want to make sure there are no stupid low-hanging fruit that we'll curse later. Better get it done with now.

25. wit Version 0.0.2 Released

19 Apr 2005 (1 post) Subject: "wit 0.0.2 - a web interface for git available"

Topics: wit

People: Christian Meder

Christian Meder said:

I've uploaded a new wit to

Wit is a web interface for git. Right now it includes: views of blob, commit and tree objects, generating patches for the commits, downloading of gz or bzip2 tarballs of trees.

It's easy to setup and a simple stand alone server configuration is included.


26. wit 0.0.3 Released; gitweb Also Progresses

19 Apr 2005 - 22 Apr 2005 (13 posts) Subject: "wit 0.0.3 - a web interface for git available"

Topics: wit

People: Christian MederChristoph HellwigKay Sievers

Christian Meder said:

I uploaded a new version of wit to

Wit is a web interface for git. Right now it includes: views of blob, commit and tree objects, generating patches for the commits, downloading of gz or bzip2 tarballs of trees.

It's easy to setup and a simple stand alone server configuration is included.


Greg KH suggested working together with gitweb, and Christoph Hellwig also praised gitweb. Christian wanted to keep development separate though.

Christoph said one thing he would like to see on gitweb was "a show all diffs link for a changeset" feature. Kay Sievers replied:

It's working now:

Many thanks to Christian Gierke for all the interface work, the nice layout and the git logo. Thanks for the colored diff to Ken Brush.

The script itself is available on the same box by ftp.

27. wit Demo Site Created

20 Apr 2005 (1 post) Subject: "wit - demo site"

Topics: SHA1, wit

People: Christian Meder

Christian Meder said:

thanks to my friend Frank Sattelberger I got access to a site where I could set up a demo for wit:

Couple of notes wrt why I work on another git web interface compared with Kay's work:

28. Arch To Adopt git

20 Apr 2005 - 22 Apr 2005 (23 posts) Subject: "[ANNOUNCEMENT] /Arch/ embraces `git'"

People: Tom Lord

Tom Lord (the Arch maintainer) said, "`git', by Linus Torvalds, contains some very good ideas and some very entertaining source code -- recommended reading for hackers. /GNU Arch/ will adopt `git'" [...] "In my view, the core ideas in `git' are quite profound and deserve an impeccable implementation." .

29. Cogito (git-pasky) 0.6.2 Released; Command Names To Change; 'git log' To Pipe To 'less'

20 Apr 2005 - 25 Apr 2005 (33 posts) Subject: "[ANNOUNCE] git-pasky-0.6.2 && heads-up on upcoming changes"

Topics: SHA1, cg-pull, diff-cache, git init, git merge, git pull

People: Petr BaudisLinus Torvalds

Petr Baudis said:

I've "released" git-pasky-0.6.2 (my SCMish layer on top of Linus Torvalds' git tree history storage system), find it at the usual

git-pasky-0.6 has couple of big changes; mainly enhanced git diff, git patch (to be renamed to cg mkpatch), enhanced git pull and completely reworked git merge - it now uses the git-core facilities for merging, and does the merges in-tree. Plenty of smaller stuff, some bugfixes and some new bugs, and of course regular merging with Linus.

The most important change for current users is the objects database SHA1 keys change and (comparatively minor) directory cache format change. This makes "pulling up" from older revisions rather difficult. Linus' instructions _should_ work for you too, basically (you should replace cat .git/HEAD with cat .git/heads/* or equivalent - note that convert-tree does not accept multiple arguments so you need to invoke it multiple times), but I didn't test it well (I did it the lowlevel way completely since I needed to simultaneously merge with Linus).

But if you can't be bothered by this or fear touching stuff like that, and you do not have any local commits in your tree (it would be pretty strange if you had and still fear), just fetch the tarball (which is preferrable than git init for me since it eats up _significantly_ smaller portion of my bandwidth).

I had to release git-pasky-0.6.1 since Linus changed the directory cache format during me releasing git-pasky-0.6. And git-pasky-0.6.2 fixes script missing in the list of scripts for install.

So, now for the heads-up part. We will undergo at least two major changes now. First, I'll probably make git-pasky to use the directory cache for the add/rm queues now that we have diff-cache.

Second, I've decided to straighten up the naming now that we still have a chance. There will be no git-pasky-0.7, sorry. You'll get cogito-0.7 instead. I've decided for it since after some consideration having it named differently is the right thing (tm).

The short command version will change from 'git' to 'cg', which should be shorter to type and free the 'git' command for possible eventual entry gate for the git commands (so that they are more namespace-friendly, and it might make most sense anyway if we get fully libgitized; but this is more of long-term ideas).

The usage changes:

cg patch -> cg mkpatch ('patch' is the program which _applies_ it)

cg apply -> cg patch (analogically to diff | patch)

cg pull will now always only pull, never merge.

cg update will do pull + merge.

cg track will either just set the default for cg update if you pass it no parameters, or disappear altogether; I think it could default to the 'origin' branch (or 'master' branch for non-master branches if no 'origin' branch is around), and I'd rather set up some "cg admin" where you could set all this stuff - from this to e.g. the committer details [*1*]. You likely don't need to change the default every day.

I must say that I'm pretty happy with the Cogito's command set otherwise, though. I actually think it has now (almost?) all commands it needs, and it is not too likely that (many) more will be added - simple means easy to use, which is Cogito's goal. Compare with the command set of GNU arch clones. ;-)

[*1*] The committer details in .git would override the environemnt variables to discourage people of trying to alter them based on whatever, since that's not what they are supposed to do. They can always just change the .git stuff if they _really_ need to.

Comments welcomed, as well as new ideas. Persuading me to change what I sketched here will need some good arguments, though. ;-)

Greg KH and others pointed out that cg was already claimed by other programs. At one point, Linus Torvalds remarked:

I realize that there is probably a law that there has to be a space, but I actually personally use tab-completion all the time, and in many ways prefer a name that can be completed without having to play games with magic bash completion files.

So how about using a dash instead of a space, and making things be


etc? You can link them all to the same script if you don't like having multiple scripts, and just match with

        case "$0" in

or something.

Yeah, yeah, it looks different from "cvs update", but dammit, wouldn't it be cool to just write "cg-<tab><tab>" and see the command choices? Or "cg-up<tab>" and get cg-update done for you..

Just because rcs/cvs/everybody-and-his-dog thinks it is cool to have a space there and have different meaning for flags depending on whether they are before the command or after the command doesn't mean that they are necessarily right..

Petr replied:

I like this idea! :-) I guess that is in fact exactly what I have been looking for, and (as probably apparent from the current git-pasky structure) I prefer to have the scripts separated anyway.

I think I will go for it. I also thought about having this _and_ a 'cg' command which would act as a completely dumb multiplexer, but I decided to toss that idea since it would only create usage ambiguity and other problems on the long run.

Greg also liked this idea.

Along a completely different track, Linus posted a Cogito patch to alter the functioning of 'git log', such that "If you redirect the output to a non-tty, both "less" and "more" do the right thing and just feed the output straight through. But if the output is a tty, this makes "git log" a lot more friendly than a quickly scrolling mess.." Daniel Jacobowitz thought this idea was completely unintuitive, but Linus said:

There is _never_ any valid situation where you do "cg-log" with unpaginated output to a tty.

In _any_ real system you'll be getting thousands of lines of output. Possibly millions. unpaginated? What the hell are you talking about?

And as I pointed out, if the output is not a tty, then both less and more automatically turn into cat, so there's no difference. This change _only_ triggers for a tty, and I dare you to show a single relevant project where it's ok to scroll thousands of lines.

Even git-pasky, which is just a two-week-old project right now outputs 4338 lines of output to "git log".

Unpaginated? You must be kidding.

(But if you are _that_ fast a reader, then hey, use "PAGER=cat", and you'll find yourself happy).

30. wit 0.0.4 Released

20 Apr 2005 (1 post) Subject: "wit 0.0.4 uploaded"

Topics: wit

People: Christian Meder

Christian Meder said:

I uploaded a new snapshot to

The changes are


If I weren't so tired I'd write something cool and nifty for sure

The demo site is at

31. Vesta Proposed As git Alternative

20 Apr 2005 (1 post) Subject: "Similarities with Vesta"

Topics: BitKeeper, Darcs, Vesta, diff-tree

People: Kenneth C. Schalk

Kenneth C. Schalk said:

When I first read about git a few days ago, I was pretty surprised by how similar it seemed to the project I've been working on for several years. (Coming off BK, I would have expected a more Darcs/Codeville kind of change-merging-centric approach.)

Since I don't see any mention of Vesta on this list or lkml, I thought I should at least join the list and say a few words. Some of the similarities which jumped out at me:

Since this isn't entirely on topic, I'll refrain from going into too much detail, but I'd be happy to answer any questions. (There's also lots of information on, multiple published papers, and a book from Springer-Verlag coming out sometime later this year.) I'm hoping that there's some opportunities for the kernel developers to get some benefit from the considerable research and development that's gone into Vesta.

A very brief history of Vesta:

Lastly, since I know git's development has been focused on performance, here's a quick example of checking in, patching, and diffing the kernel source in Vesta:

# Create a branch to use for the 2.6.0 sources.

% vbranch -O /vesta/
Creating branch /vesta/
% vcheckout /vesta/
Reserving version /vesta/
Creating session /vesta/
Making working directory /vesta-work/ken/kernel

# Unpack the 2.6.0 source into the working directory.

% cd /vesta-work/ken/kernel
% tar -jxf /tmp/kernel/linux-2.6.0.tar.bz2
31.90s user 7.98s system 46% cpu 1:24.92 total

# Move everything up out of the directory with the version name just
# to simplify the rest of the example.

% mv linux-2.6.0/* .
% rmdir linux-2.6.0

# Take an immutable snapshot of the working copy.  This is basically
# hashing the file contents and taking a snapshot of the directory
# structure.  (The real work happens inside the repository server,
# which is running in the background.)

% vadvance
Advancing to /vesta/
0.75s user 1.08s system 4% cpu 39.371 total

# Check in the snapshot.  (This is a very fast operation, because all
# it's really doing is giving a new name to the snapshot taken in the
# previous step.)

% cd ..
% vcheckin kernel
Checking in /vesta/
Deleting /vesta-work/ken/kernel
0.10s user 0.07s system 33% cpu 0.510 total

# Make a new branch for 2.6.1 based on the 2.6.0 version just checked
# in.  (Again, no real work here.)

% vbranch -o /vesta/ \
Creating branch /vesta/
0.01s user 0.01s system 27% cpu 0.073 total

# Apply the 2.6.1 patch.

% cd /vesta-work/ken/kernel
% bzcat /tmp/kernel/patch-2.6.1.bz2 | patch -p1
bzcat  0.70s user 0.03s system 91% cpu 0.796 total
patch  0.74s user 1.52s system 22% cpu 10.216 total

# Take a snapshot of the changes and check them in.

% vadvance
Advancing to /vesta/
0.33s user 0.38s system 22% cpu 3.092 total
% cd ..
% vcheckin kernel
Checking in /vesta/
Deleting /vesta-work/ken/kernel
0.05s user 0.03s system 41% cpu 0.193 total

# Repeat just for some more timing numbers.

% vbranch -o /vesta/ \
Creating branch /vesta/
0.01s user 0.01s system 167% cpu 0.012 total
% vcheckout /vesta/
Reserving version /vesta/
Creating session /vesta/
Making working directory /vesta-work/ken/kernel
0.01s user 0.00s system 62% cpu 0.016 total
% cd /vesta-work/ken/kernel
% bzcat /tmp/kernel/patch-2.6.2.bz2 | patch -p1
bzcat  1.89s user 0.06s system 92% cpu 2.107 total
patch  1.67s user 3.21s system 19% cpu 24.807 total
% vadvance
Advancing to /vesta/
0.38s user 0.47s system 18% cpu 4.559 total
% cd ..
% vcheckin kernel
Checking in /vesta/
Deleting /vesta-work/ken/kernel
0.07s user 0.03s system 38% cpu 0.260 total

# Let's see how fast my quick diff-tree clone is.

% cd /vesta/
% 2.6.0/1 2.6.1/1
< 2.6.0/1/CREDITS
> 2.6.1/1/CREDITS
< 2.6.0/1/Documentation/DocBook/kernel-locking.tmpl
> 2.6.1/1/Documentation/DocBook/kernel-locking.tmpl
< 2.6.0/1/sound/sound_core.c
> 2.6.1/1/sound/sound_core.c
< 2.6.0/1/usr/gen_init_cpio.c
> 2.6.1/1/usr/gen_init_cpio.c
1.50s user 0.44s system 50% cpu 3.871 total
% 2.6.1/1 2.6.2/1
1.70s user 0.47s system 47% cpu 4.526 total
% 2.6.0/1 2.6.2/1
1.85s user 0.56s system 48% cpu 4.935 total

# /vesta is a mount point for a virtual filesystem provided by the
# repository server.  All the versions created above can be accessed
# directly, so you can use normal diff to generate patches.

% diff -Nru 2.6.0/1 2.6.1/1 > /tmp/kernel/patch-2.6.1-recreated
0.79s user 2.00s system 39% cpu 7.149 total
% diff -Nru 2.6.1/1 2.6.2/1 > /tmp/kernel/patch-2.6.2-recreated
1.40s user 2.51s system 38% cpu 10.070 total

Just a couple final notes on this demonstration:

Thanks for your time.

32. Linus's Ideas For Tagging

21 Apr 2005 - 24 Apr 2005 (57 posts) Subject: "Re: Git-commits mailing list feed."

Topics: BitKeeper, SHA1, fsck-cache

People: Linus Torvalds

Linus Torvalds said:

The reason I've not done tags yet is that I haven't decided how to do them.

The git-pasky "just remember the tag name" approach certainly works, but I was literally thinking o fsetting up some signing system, so that a tag doesn't just say "commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 is v2.6.12-rc2", but it would actually give stronger guarantees, ie it would say "Linus says that commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 is his 2.6.12-rc2 release".

That's something fundamentally more powerful, and it's also something that I actually can integrate better into git.

In other words, I actually want to create "tag objects", the same way we have "commit objects". A tag object points to a commit object, but in addition it contains the tag name _and_ the digital signature of whoever created the tag.

Then you just distribute these tag objects along with all the other objects, and fsck-cache can pick them up even without any other knowledge, but normally you'd actually point to them some other way too, ie you could have the ".git/tags/xxx" files have the pointers, but now they are _validated_ pointers.

That was my plan, at least. But I haven't set up any signature generation thing, and this really isn't my area of expertise any more. But my _plan_ literally was to have the tag object look a lot like a commit object, but instead of pointing to the tree and the commit parents, it would point to the commit you are tagging. Somehting like

        commit a2755a80f40e5794ddc20e00f781af9d6320fafb
        tag v2.6.12-rc3
        signer Linus Torvalds

        This is my official original 2.6.12-rc2 release

        -----BEGIN PGP SIGNATURE-----
        -----END PGP SIGNATURE-----

with a few fixed headers and then a place for free-form commentary, everything signed by the key (and then it ends up being encapsulated as an object with the object type "tag", and SHA1-csummed and compressed, ie it ends up being just another object as far as git is concerned, but now it's an object that tells you about _trust_)

(The "signer" field is just a way to easily figure out which public key to check the signature against, so that you don't have to try them all. Or something. My point being that I know what I want, but because I normally don't actually ever _use_ PGP etc, I don't know the scripts to create these, so I've been punting on it all).

If somebody writes a script to generate the above kind of thing (and tells me how to validate it), I'll do the rest, and start tagging things properly. Oh, and make sure the above sounds sane (ie if somebody has a better idea for how to more easily identify how to find the public key to check against, please speak up).

He added:

Btw, in case it wasn't clear, one of the advantages of this is that these objects are really _not_ versioned themselves, and that they are totally independent of the objects that they actually tag.

They spread together with all the other objects, so they fit very well into the whole git infrastructure, but the real commit objects don't have any linkages to the tag and the tag objects themselves don't have any history amongst themselves, so you can create a tag at any (later) time, and it doesn't actually change the commit in any way or affect other tags in any way.

In particular, many different people can tag the same commit, and they don't even need to tage their _own_ commit - you can use this tag objects to show that you trust somebody elses commit. You can also throw the tag objects away, since nothing else depends on them and they have nothing linking to them - so you can make a "one-time" tag object that you can pass off to somebody else, and then delete it, and now it's just a "temporary tag" that tells the recipient _something_ about the commit you tagged, but that doesn't stay around in the archive.

That's important, because I actually want to have the ability for people who want me to pull from their archive to send me a message that says "pull from this archive, and btw, here's the tag that not only tells you which head to merge, but also proves that it was me who created it".

Will we use this? Maybe not. Quite frankly, I think human trust is much more important than automated trust through some technical means, but I think it's good to have the _support_ for this kind of trust mechanism built into the system. And I think it's a good way for distributors etc to say: "this is the source code we used to build the kernel that we released, and we tagged it 'v2.6.11-mm6-crazy-fixes-3.96'".

And if my key gets stolen, I can re-generate all the tags (from my archive of tags that I trust), and sign them with a new key, and revoke the trust of my old key. This is why it's important that tags don't have interdependencies, they are just a one-way "this key trusts that release and calls it xyzzy".

In the course of discussion, Jan Harkes pointed out that it would be a difficult proposition for users to fetch tags from a remote repository. But Linus said:

this is a _feature_.

Other people normally shouldn't be interested in your tags. I think it's a mistake to make everybody care.

So you normally would fetch only tags you _know_ about. For example, one of the reasons we've been _avoiding_ personal tags in teh BK trees is that it just gets really ugly really quickly because they get percolated up to everybody else. That means that in a BK tree, you can't sanely use tags for "private" stuff, like telling somebody else "please sync with this tag".

So having the tag in the object database means that fsck etc will notice these things, and can build up a list of tags you know about. It also means that you can have tag-aware synchronization tools, ie exactly the kind of tools that only grab missing commits can also then be used to select missing tags according to some _private_ understanding of what tags you might want to find..

33. Adding Pathname Prefixes To Files In checkout-cache

21 Apr 2005 (1 post) Subject: ""checkout-cache" update"

Topics: checkout-cache

People: Linus Torvalds

Linus Torvalds said:

I just pushed out this very useful thing to "checkout-cache", which is best just described by its commit log:

Add the ability to prefix something to the pathname to "checkout-cache.c"

This basically makes it trivial to use checkout-cache as a "export as tree" function. Just read the desired tree into the index, and do a

checkout-cache --prefix=export-dir/ -a

and checkout-cache will "export" the cache into the specified directory.

NOTE! The final "/index.html" is important. The exported name is literally just prefixed with the specified string, so you can also do something like

checkout-cache --prefix=.merged- Makefile

to check out the currently cached copy of "Makefile" into the file ".merged-Makefile".

Basically, I can do a a "git-0.6" release with a simple

checkout-cache --prefix=../git-0.6/ -a

which basically says: check out all files, but use the prefix "/git/../git-0.6" before the filename when you do so.

Then I just do

cd ..
tar czvf git-0.6.tar.gz git-0.6

and I'm done. Very cool, very simple, and _extremely_ fast.

Doing the tree export (not the tar) for the whole kernel takes two minutes in the cold-cache case (not so wonderful, but acceptable), and 4.6 _seconds_ in the hot-cache case (pretty damn impressive, I say).

(The compressng tar then takes about 20 seconds for me, and that's obviously all from the cache, since I just wrote it out).

NOTE! The fact that the '/index.html' at the end of the --prefix= thing is meaningful can be very confusing, I freely admit. But it does end up being potentially quite useful, and you're likely to script usage of this anyway into "git export" or something, so...

34. New GIT_INDEX_FILE Environment Variable; Status Of show-diff

21 Apr 2005 - 22 Apr 2005 (14 posts) Subject: ""GIT_INDEX_FILE" environment variable"

Topics: Compression, SHA1, checkout-cache, diff-cache, diff-tree, read-tree, show-diff

People: Linus TorvaldsDavide LibenziPetr Baudis

Linus Torvalds said:

This checkin goes along with the previous one, and makes it easier to use all the normal git operations on temporary index files:

Add support for a "GIT_INDEX_FILE" environment variable.

We use that to specify alternative index files, which can be useful if you want to (for example) generate a temporary index file to do some specific operation that you don't want to mess with your main one with.

It defaults to the regular ".git/index" if it hasn't been specified.

and it's particularly useful for doing things like "read a tree into a temporary index file, and write the result out". For example, say that you wanted to know what the Makefile looked like in a particular release, you could do

    GIT_INDEX_FILE=.tmp-index read-tree $release
    GIT_INDEX_FILE=.tmp-index checkout-cache --prefix=old- Makefile
    rm .tmp-index

and you're done. Your old Makefile version is now in "old-Makefile" (and this is also where it's nice that checkout-cache refuses to overwrite existing files by default: if you forgot or messed up the prefix, it's all good).

You can also use it to test merges without screwing up your old index file in case something goes wrong.

Did I already happen to mention that I think that the git model is the best model ever, and that I'm just not an incredibly good-looking hunk and becomingly modest, I'm smart too?

Davide Libenzi replied, "You forgot, *again*, to take your medications !!"

In the course of discussion, Petr Baudis remarked, "Note that Cogito almost actually does not use show-diff anymore. I'm doing diff-cache now, since that is what matters to me." And Linus said:

Indeed. "diff-tree" (between releases) and "diff-cache" (between a release and the current state) are clearly much more fundamental operations.

Also, they have absolutely zero policy, and they're designed to be used with the same scripting engines (ie hopefully you can use just one tool to show the output of either in whatever format you want).

They show you what the canonical names and associated information is, and that's it. What you _do_ with them ends up being outside the scope of git, exactly like it should be. Details like "what format of diff" to produce should be left to the tools around it.

In contrast, "show-diff" was _literally_ written to check what the difference between the "real" file and a "sha1" file was, back when I couldn't write the sha1 files correctly (ie I corrupted anything that didn't fit in the first "[un]compression block", and then calling "diff" to show the difference between the original and the regenerated data was very important).

So "show-diff" just kind of expanded from an early debugging tool to something that _almost_ looks like a real tool. But it's absolutely the right thing to use "diff-tree" and "diff-cache" instead.

35. git-pasky (Cogito) 0.6.3 Released

21 Apr 2005 - 22 Apr 2005 (10 posts) Subject: "[ANNOUNCE] git-pasky-0.6.3 && request for testing"

Topics: diff-cache, diff-tree, read-tree, show-diff

People: Petr BaudisGreg KHLinus Torvalds

Petr Baudis said:

I've released git-pasky-0.6.3 earlier in the night. It brings especially plenty of bugfixes, but also some tiny enhancements, like colored log and ability to pick branch in the remote repository. git log and git patch now also accept range of commits, so e.g. if you do

git patch linus:this

you should get a sequence of patches (commit message + patch, with delimiters between patches) which will bring you from linus to your current HEAD. Of course the package is in sync with Linus' branch.

Get it at

or pull (it should work fine, no format changes).

Not released stay changes I made later tonight, which change git-pasky's usage of directory cache - it will record adds/removals to it and use diff-cache instead of show-diff to check any differences. The code is much simpler, but likely some small bugs were introduced in the process - please report any problems you'll hit, and test heavily. What is known is that you cannot diff specific files now.

Greg KH found an interesting bug. As he described:

go into a kernel git tree.

rm Makefile
git diff

Watch it as it thinks that every Makefile in the kernel tree is now gone...

Petr Baudis identified this as a bug in git's diff-cache code, and Linus Torvalds replied:

Nice find.

Yes, I told you guys I hadn't tested it well ;)

"diff-cache" does the same "diff trees in lockstep" thing that "diff-tree" does, but it's actually more complex, since the _tree_ part always needs to be recursively followed, while the _cache_ part is this linear list that is already expanded.

Which just made the whole algorithm very messy.

Once I found out how nasty it was to do that compare, I was actually planning to re-write the thing using the same approach that "read-tree -m <tree>" does - ie move the tree information _into_ the in-memory cache, at which point it should be absolutely trivial to compare the two. But since the horrid algorithm seemed to end up working, I never did.

I'm not even going to debug this bug. I'm just going to rewrite diff-cache to do what I should have done originally, ie use the power of the in-memory cache. That's also automatically going to properly warn about unmerged files.

Give me five minutes ;)

Within an hour, he posted a fix.

36. Optimized SHA1 Calculation For Git On PowerPC Systems

21 Apr 2005 (1 post) Subject: "[PATCH] optimized SHA1 for powerpc"

Topics: SHA1, fsck-cache

People: Paul Mackerras

Paul Mackerras said, "Just for fun, I wrote a ppc-assembly SHA1 routine. It appears to be about 2.5x faster than the generic version. It reduces the time for a fsck-cache on a linux-2.6 tree from ~6.8 seconds to ~6.0 seconds on my G4 powerbook."

37. Cogito Debian Package

22 Apr 2005 (4 posts) Subject: "[PATCH] git-pasky debian dir"

Joshua T. Corbin produced a Debian package of Cogito (git-pasky) and made it available at, or as a git tree at rsync:// Chris Wright offered some suggestions, but there was no real discussion.

38. Cogito Hosted On

22 Apr 2005 (2 posts) Subject: "[FYI] Cogito rsync/download location moved"

People: Petr Baudis

Petr Baudis said:

I'm happy to announce that the Cogito rsync location changed to


Please update your .git/remotes accordingly.

Also please note that the Cogito download location changed too. From now on, Cogito releases will appear at


Please update your bookmarks, if you have any.

This will hopefully make me fit to my bandwidth limit until the end of the month, and should make things significantly faster for you. Thanks a lot to the folks who made it possible!

39. Using git To Create Redundant Mail Servers

22 Apr 2005 - 23 Apr 2005 (5 posts) Subject: "Git for redundant mail servers"

People: David WoodhouseJon SeymourDavid Lang

David Woodhouse suggested:

Random alternative use for git... we could use it to provide a cluster of redundant mail delivery/storage servers.

The principle is simple; you use something like a set of Maildir folders, stored in a git repository. Any action on the mail storage is done as a commit -- that includes delivery of new mail, or user actions from the IMAP server such as changing flags, deleting or moving mail. These actions are actually fairly efficient when Maildir folders are stored in a git repository -- the IMAP model is that mails are immutable, and flag changes are done as renames.

In the normal case where all the servers are online, each commit is immediately pushed to each remote server. When a server is offline or separated somehow from the rest of the group, it's going to have to do a merge when it reconnects -- we'd implement a Maildir-specific merge algorithm, which really isn't that hard to do.

In this case we'd probably want to make active use of the feature of git which allows you to prune history. You don't need to keep any history further back than the commit which will be the common ancestor when a currently-absent member of the cluster eventually comes back. In the common case, that will actually be no history at all, since all members will be present.

You can then have multiple members of a cluster, each running an SMTP server and allowing for delivery of email, and each running an IMAP server. Clients can connect to any of the machines and receive IMAP service, and email will continue to flow inward, as long as at least one machine in the cluster remains alive.

Jon Seymour replied, "This is a cool idea. When the concept is rendered this way, it sounds a lot like some of the core principles in the architecture of the Lotus Notes replication engine. I've always thought it would be cool to have an open engine that provided similar functionality to the Lotus Notes replication engine without the naff programming environment that sits on top. I can see how the git concepts and code could provide the basis of such a solution. Very cool." David Lang replied:

Having been in several discussions on the cyrus mailing list about replication let me point out a couple basic problems that you have to work around.

  1. when a new message arrives it gets given a numeric messageid, this message id is not supposed to change without fairly drastic things happening (the server telling all clients to forget everything they know about the status of the mailbox). this requires syncronization between servers if both are receiving messages.
  2. git effectivly stores snapshots of things and you deduce the changes by comparing the snapshots. for things like flags changing this is a relativly inefficiant way to replicate changes (although if one server is offline for a while it could be a firly efficiant way to do the merge)

and now a couple of starting points

Cyrus already implements single-instance store so the concept of the same message living in multiple places doesn't have to be grafted in. it keeps the message flags seperate from the messages themselves so the messages could be replicated seperatly from the state.

personally I'm not seeing git being a huge advantage for this, but I do see some advantages and it's very possible I'm missing some others.

go for it.

David W. said, "We don't have to stick _precisely_ to Maildir -- but flag changes are just a rename in Maildir, leaving the mail object entirely intact while changing only the tree. That isn't _so_ bad; but yes, it could probably be done a little better than just "Maildir in git"." He added, regarding David L.'s point 1:

Yeah, that's the most interesting part. One option would be to require quorum before a server is allowed to add to a mailbox -- but that would render the thing unsuitable for _intentional_ offline use, where you want to be able to move mails from one folder to another on your laptop while it's disconnected.

Since it should be relatively rare for 'competing' commits to occur during periods of disconnection, I suspect that the solution doesn't have to be particularly efficient. I'm not sure I'd really want to change UIDVALIDITY if it happened, but perhaps we could simply remove _all_ the affected UIDs, and assign new UIDs to the same mails.

In practice, it's far more important that for us to ensure that an existing UID _never_ refers to a different mail, than it is to make sure that a given mail always keeps the same UID.

40. Respect File Permissions During Merge

23 Apr 2005 - 24 Apr 2005 (14 posts) Subject: "[PATCH] make file merging respect permissions"

People: James Bottomley

James Bottomley said:

I noticed when playing about with merging that executable scripts lose their permissions after a merge. This was because unpack-file just unpacks to whatever the current umask is.

I also noticed that the merge-one-file-script thinks that if the file has been removed in both branches, then it should simply be removed. This isn't correct. The file could have been renamed to something different in both branches, in which case we have an unflagged rename conflict.

The attached fixes both issues

He posted a patch, and after some issues of patch corruption due to James editing them by hand, Linus Torvalds accepted them into his tree.

41. git-pasky 0.7 Released; Official Rename To Cogito Imminent

23 Apr 2005 - 24 Apr 2005 (17 posts) Subject: "[ANNOUNCE] git-pasky-0.7"

Topics: Compression, checkout-cache, diff-cache, read-tree, show-diff, update-cache

People: Petr Baudis

Petr Baudis said:

this is the last release of git-pasky, my SCMish layer over Linus' git tree history storage tool. The next releases will be called 'cogito' and will feature a significantly reworked user interface (finally). Get git-pasky-0.7 at


You can also pull, but actually you might as well not want to do that if you don't know that you will be able to recover possible inconsistencies (for no local changes, read-tree $(tree-id) && checkout-cache -f -a && update-cache --refresh should do). The pulling/merging tools in older versions contain bugs which _might_ affect this pull.

The biggest change is in the way the directory cache is used (this is internal thing, nothing user-visible except less bugs). Now that we have diff-cache, git-pasky uses that instead of show-diff, and drops the add/rm queues. This also makes the diffs coming from git diff more consistent-looking.

To pick randomly from the other changes - older zlib compatibility, always use bash, git patch output changes/fixes, git log timezone fix, plenty of bugfixes and of course merges with Linus. Thanks to all the contributors!

42. Cogito Tutorial; Big User Interface Changes Imminent

23 Apr 2005 (3 posts) Subject: "A very basic tutorial"

People: James PurserDavid A. Wheeler

James Purser said:

I've done a very basic tutorial for using git. If someone who knows more about git (pretty much everyone on this list I think) could have a look and point out any glaring errors that would be great.

David A. Wheeler replied:

You mention two layers, but the layers are about to get (pun?) separate names: git (lower layer) and Cogito (upper layer, and separable from the "real" git lower layer). You should explain the two names.

You concentrate on Cogito, which is almost certainly reasonable for normal starting points, but that's not clear in your text.

Also: The Cogito user interface is about to undergo a significant rename of the commands. It won't be hard to update your text for it, but you'll need to do so.

James said he'd update the tutorial.

43. convert-cache Enhancements

23 Apr 2005 - 24 Apr 2005 (3 posts) Subject: "Old "sparse" archive converted.."

Topics: SHA1, convert-cache, fsck-cache, read-tree, update-cache

People: Linus Torvalds

Linus Torvalds said:

Ok, I expanded on "convert-cache" quite a bit, to the point where it could actually re-write "commit" and "tree" objects entirely, at which point it now became possible to convert the broken old sparse archive that had the wrong date format in its commit objects, and had the old flat-file format for the "tree" objects.

I updated the tree at

to be this modern-format tree that fsck-cache no longer complains about.

If anybody is keeping an old-format tree around, I'll just warn you that I'm not going to necessarily guarantee that git will be able to read it in the future, so you should replace it with the modern one (which should just be a matter of either deleting the old one and getting a new copy, or you could do a

        convert-cache $(cat .git/HEAD)
        echo ..result.. > .git/HEAD
        read-tree $(cat .git/HEAD)
        update-cache --refresh
        fsck-cache --unreachable $(cat .git/HEAD)

which _should_ get you a clean tree with a top commit with the name f765413f020fd2c97c22716320dc96b33cda7c43 if everything went right.

I'm considering turning on SHA1 validation by default when reading objects, just because it's the right thing to do from a "find any fs corruption early" angle, and it should be fairly cheap. That will make old pre-conversion trees no longer work.

Jeff Garzik asked Linus to copy his work to /pub/scm/linux/kernel/git/torvalds/sparse.git, and Linus did so.

44. Renaming All Core git Commands

23 Apr 2005 - 26 Apr 2005 (22 posts) Subject: "Re: [GIT PATCH] Selective diff-tree"

Topics: diff-cache, diff-tree, show-diff

People: Linus TorvaldsJunio C. Hamano

In the course of discussion, Linus Torvalds said, "at some point we really should prepend "git-" to all the git commands. I didn't want to do the extra typing when I started out and was unsure about the name, but hey, by now we really should." He asked Junio C. Hamano what he thought about this. Junio apparently didn't see this part of Linus's post, and a few posts down the line, asked:

how about renaming show-diff to diff-file?

    diff-tree  : compares two trees.
    diff-cache : compares a tree and the cache, or a tree and files.
    diff-file  : compares the cache and files.

Linus replied, "Yes. Except I think that the "big renaming" is coming up, and we should just rename them to have a "git-" prefix too."

45. gitweb Updated

24 Apr 2005 (1 post) Subject: "/git/gitweb.html"

People: Kay Sievers

Kay Sievers said:

We have a new improved version of the git web-interface:

You can download it from:

gitweb exports now a simple RSS feed: :)

gitweb accepts now a whole path instead of only a name to a project. Example:

46. darcs Able To Pull From git Repositories

24 Apr 2005 - 26 Apr 2005 (7 posts) Subject: "A darcs that can pull from git"

Topics: Darcs

People: Juliusz ChroboczekDavid RoundyLinus Torvalds

Juliusz Chroboczek said:

I've just finished putting together a hack for darcs to allow it to pull from Git repositories. You'll find the patch (Darcs patch, not diff patch) on

You should get yourself a copy of darcs-unstable, then apply this patch:

  $ darcs get darcs-git
  $ cd darcs-git
  $ darcs apply darcs-git-20050424.darcs
  $ make darcs

If you get merge conflicts, try using a version of the darcs-unstable tree from 18.04.2005, which is what I started with.

A minor problem: there's something broken with the build procedure; you'll probably need to manually do a ``make Context.hs'' followed with ``make darcs'' when the build breaks.

After you build darcs-git, you should be able to do something like

  $ cd ..
  $ mkdir a
  $ cd a
  $ darcs initialize
  $ ../darcs-git/darcs pull /usr/local/src/git-pasky-0.4
  $ darcs changes

This version can *pull* from git, but it cannot push; in other words, the only way to export your data from Darcs back to git is to use diff and patch.

Please be aware that this is just a proof-of-concept prototype. David and the rest of the Central Committee haven't looked at this code yet; it is quite likely that future versions of Darcs will generate completely different patches from git repositories. It is also likely that THIS CODE WILL EAT YOUR DATA.

The major issue is that we generate no patch dependencies. If you try to cherry-pick from repositories generated with this version, you'd better know what you're doing.

David, could you please have a look at the patches

  Sun Apr 24 16:50:02 CEST 2005  Juliusz Chroboczek <>
    * First cut at remodularising repo access.
  Sun Apr 24 16:01:32 CEST 2005  Juliusz Chroboczek <>
    * Change Repository to DarcsRepo.

and tell me whether this sort of restructuring is okay with you.

(David, I'm not claiming that this scheme is better than the ``tagging like crazy'' scheme that you outlined; I'm only trying to prove that my scheme is workable.)

Right now, I'm taking a Git commit and manually generating a Darcs patch id from that, which is a bad idea. A better way would be to get Darcs to deal with arbitrarily shaped patch ids; a patch that originates with git would get the git patch id, while a patch that comes from Darcs would retain its patch id even when pushed to git. David, you had some objections to that; any chance we could discuss the issue?

This is slow. There are a few obvious improvements to make to the performance, but I'd rather first implement whatsnew, diff and apply, and fix the problem with patch dependencies. (Whatsnew is where git's performance is actually likely to be better than Darcs, but it will require some abstracting of ``Slurpy'' in order to make that effective.) Unfortunately, I don't expect to have hacking time before next week-end.

David Roundy (the darcs maintainer) was very happy to see this, and said:

First off, you need to include a license header in the git files indicating that unlike the rest of darcs, they may only be distributed under GPL v2. Something like the following would probably be fine (but it's Linus' copyright that's involved, not mine)

 * GIT - The information manager from hell
 * Copyright (C) Linus Torvalds, 2005

  This program is free software; you can redistribute it and/or modify
  it under the terms of version 2 of the GNU General Public License as
  published by the Free Software Foundation.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software Foundation,
  Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Without this header, it's either illegal to distribute these files, or they're assumed to be under GPLv2 or later along with the rest of darcs, which also isn't legal...

Juliusz said, "Linus, could you please suggest a suitable license statement to include in whichever files of yours we choose to include in Darcs? Is David's suggestion (stock GPL boilerplate with "or any later version" removed) okay with you?" Linus Torvalds replied, "Stock GNU boilerplate without the "or any later version" works fine. As does a simple one-liner "Licensed under GPLv2", for that matter. It's not like there can be any real confusion."

47. New 'Bit' Version Control System Based On git

24 Apr 2005 - 25 Apr 2005 (6 posts) Subject: "[FILE] GNU BIT"

Topics: SHA1, checkout-cache, read-tree

People: Andreas GalPetr Baudis

Andreas Gal said:

BIT - a little bit like SCM

BIT is a training exercise in shell programming and is the result of my attempts to wrap my head around GIT's inner working. BIT's command line interface should be very familiar to anyone who worked with other(tm) SCM tools before. I try to not depend on custom GIT features. BIT uses the off-the-shelf GIT core tools distributed by Linus. This means that BIT has about 2% of the features of Cogito. Also, it has about 0.1% of the user base of Cogito, so its probably very broken and I really don't recommend using it.

You can obtain BIT from the following GIT repository:

You can use the GIT core utilities to pull and check-out BIT:

curl > .git/HEAD
http-pull -a `cat .git/HEAD`
read-tree `cat .git/HEAD`
checkout-cache -f -a

Naturally, you can also use BIT to pull the current sources, which is much simpler:

bit clone

This will create a directory "bit", pull the latest sources and perform a check-out for you.


Put "bit" anywhere in your search path. Its only a single bash script. It requires a link "bit-resolve" to itself and the (soft) link should reside in the same directory as "bit" itself. "bit" acts as merge script when invoked through that link.

At this point, BIT's functionality is minimal. It does what I need it for. I will obviously add more commands as we go along, but I won't touch things like tags and stuff like that until Linus' makes up his mind how to do it *RIGHT*.



$ bit clone

(Note: Don't forget the '/index.html' at the end, otherwise http-pull won't work.)

This command pulls Linus' latest GIT tree to a local repository "git.git". You can change the latter by giving clone an additional argument.

$ bit clone git-trunk

Once you have a copy of the remote repository, you can check whether there are new changesets in the remote repository that you haven't seen yet:

$ bit changes -R

If you see any changes, you can merge them into your own tree:

$ bit pull


Now lets assume you want to work on an extension to GIT. For this, we will clone the repository:

$ bit clone git-trunk git-bit

This will create a copy of the git-trunk repository and name it "git-bit". The object directory is shared (using a soft link), which has the nice benefit that once you run "bit pull" on one of the repositories, the other one will be able to merge changes without any network traffic (except for reading the current HEAD).

Lets say we make some changes to sha1_file.c and want to commit it to our local repository "git-bit":

... edit sha1_file.c ...

In case we already forgot what file we edited, "bit pending" will tells us:

$ bit pending

Just in case we still can remember what we changed, there is "bit diffs", which shows a diff to the current HEAD or any other version of our tree.

$ bit diffs
--- 28ad1598e54200ca8ee1261ed7beb4e31e20b2f1/sha1_file.c
+++ sha1_file.c
@@ -70,6 +70,7 @@
        int i;
        static char *name, *base;

+       /* added a cool new feature here */
        if (!base) {
                char *sha1_file_directory = getenv(DB_ENVIRONMENT) ? : ...
                int len = strlen(sha1_file_directory);

To commit our changes, we use "bit commit". It will fire up "vi" to ask for a commit message.

$ bit commit

... enter commit message in vi ...


Lets assume Linus' put out a new version of GIT, so we want to update both of our repositories. First lets do this for "git-trunk".

$ cd bit-trunk
$ bit pull

(Note: You have to specify the URL explicitly every time because there is no consensus yet where to store this information. Once thats sorted out, this will be automatic, of course.)

As this repository only tracks Linus' sources, there should be no conflicts.

Now lets go to our "git-bit" repository and do the same there:

$ cd git-bit
$ bit pull

Because both repositories share the object directory, you will get away with minimal network traffic. Conflicts are resolved using RCS merge. If that fails, you have to edit the offending files yourself.


Lets assume we want to send our improvements to Linus. For this, we can ask changes to show us all local changes in our repository:

bit changes -L

There is currently no mechanism in BIT to generate patches automatically, but I will add one shortly. What is working already is that you can push your repository to a remote location:

$ bit push ssh://

This will update the remote repository via SSH and set HEAD to point to your latest version. Please note that you have to create a repository at the remote location using "init-db".


Try "bit --help" to get some simple instructions how to use BIT. All commands have builtin help as well. Try "bit commit --help". Not all options are always implemented. Feel free to send me a patch.

Petr Baudis asked, "Do you intend to licence it as free software? Also, you call it "GNU BIT". Does that mean it is part of the GNU project?" Andreas replied, "bit contains various pieces and snippets from Linus' and your scripts and thus automatically falls under GPL. Maybe I should have stated that somewhere explicitly, but its a 800 lines bash script for crying out loud. The "GNU" part was actually more like a joke. If you feel offended by it, I am happy to remove it. Are there any trademark issues involved in using "GNU"?" Petr said the license was worth noting explicitly, and added, "I think it's better not to use "GNU" in the name if it's not part of the GNU project."

48. Tag Object Implemented In git

25 Apr 2005 - 27 Apr 2005 (14 posts) Subject: "git "tag" objects implemented - and a re-done commit"

Topics: SHA1, fsck, git-mktag

People: Linus TorvaldsH. Peter AnvinPetr BaudisAndreas Gal

Linus Torvalds said:

Ok, I just pushed out my "tag" object implementation, and due to some local braindamage over here, I ended up re-doing one commit, so if you happened to pull my 'git' tree at _just_ the right time, you will have a commit object named 06a02346f6a2e9ff113c189629ff7148f5141bb0 in your git repository, which is not exactly bogus, but which I ended up undoing.

So if you've been pulling my git stuff, check your "git log" for whether you find that commit in your stuff. If you do, I guess it doesn't much matter (ie should all merge in cleanly), but if you want to match my tree, you should first undo it if it's your HEAD commit (by setting your HEAD to the _parent_ of that commit, and then running the git-prune-script thing).

Anyway, I decided that my original model for tags was the right one, with a trivial extension. Notably, if you want to tag a single file or a tree object, go wild. The tag object format is:

        object <sha1>
        type <type>
        tag <tag>
        .. free-form commentary and signature of this all ..

and the "git-mktag" program verifies that the three first lines are valid before it accepts it and writes it as a git object.

Right now the tags don't do anything, except fsck can verify them (not the signature - git doesn't even specify any particular format, and you may validly have unsigned tags in your tree), and will print out something like

tagged commit e83c5163316f89bfbde7d9ab23ca2e25604af290 (v2.6.12-rc2)

if you were to have such a tag-object in your object database (you don't, because I've not generated one, but hey..)

H. Peter Anvin said, "It would be good if the tag object could permit junk lines before the start of the header" , but Linus said, "No, I've already explained why git doesn't parse arbitrary junk: I want git to have 100% repeatable behaviour. And that very much means that if git doesn't understand something, it just doesn't touch it or parse it." Petr Baudis asked, "Could we please at least maintain the newline between the "header" and data, like in the commit objects?" And Linus said, "Yes, I did that in the "git-tag-script" I actually committed, although git doesn't currently really care (ie fsck won't complain)."

Elsewhere, Andreas Gal asked, "Are tag objects referenced by trees (and thus limited in scope) or are they stand-alone entities in the repository? The latter would be bad for shared object storages. Also, if I delete and recreate tags, will the old tag remain in the tree or will the file in the object storage disapear? So far all objects were always persistent, which is a nice property to have." Linus replied:

They are totally stand-alone, and I don't see why that would be bad for shared object storage.

In fact, the whole point of them is that since they are named by the SHA1 hash of the content, there are no shared object issues. Two different tags by two different developers will have different names, exactly the same way two different releases will have different names.

And if two different developers tag exactly the same object with exactly the same tag-name and exactly the same signature, then they get the same tag object, and that's fine. They should.

He added, "Tags in no way affect the entry they name. When you remove - or add - a tag object, nothing happens to anything else." Andreas replied:

if tags are standalone objects then I don't see how they get propagated. Right now all I need to do to pull a version from a remote repository is to get the commit object and everything it depends on. Any tags involved would not be pulled as there are no dependencies from the commit object to the tag. Thats why I was asking whether they are part of any tree or not.

So how do we want to do this? Maybe a file TAGS right next to HEAD that lists all active tags in my tree by SHA1 hash? Or maybe make tags a linked list, with all tags refering to some parent tag? That would give a nice list to walk back to find older tags (again, we use something like TAG as root).

Linus replied:

You propagate them "by hand" (which eventually obviously means "with tools to do so").

The thing is, you _shouldn't_ be interested in my tags unless I -tell- you to be interested in them.

So I'll probably just push out my tags with my archives, and then people can verify them if they want to.

A few minutes later, he said, "Ok, for the intrepid users, you can now test to see if you can pick them out. fsck should make them totally obvious, and here's my public key in case you also want to verify the things. Of course, since I normally don't use pgp signing etc, it's entirely possible that I've done something stupid, and I'm now sending you my secret key and my full porn-collection." He posted his public key:

Version: GnuPG v1.2.4 (GNU/Linux)


49. git-aware Darcs Repository Available

25 Apr 2005 (1 post) Subject: "Git-aware darcs: gettable repo"

Topics: Darcs, git init, git pull

People: Juliusz Chroboczek

Juliusz Chroboczek said:

Just to let you know that, thanks to some friendly tagging by Ian Lynagh, I've been able to set up a gettable Darcs repository of the Git-aware version of Darcs.

If you're on a Linux system with darcs, ghc 6.2, libz, libcurl and libssl, you should be able to do

  $ darcs get --partial
  $ cd darcs-git
  $ make darcs
  $ make Context.hs
  $ make darcs
  $ mv darcs ~/bin/darcs-git
  $ cd ..
  $ mkdir linux
  $ cd linux
  $ darcs-git initialize
  $ darcs-git pull /usr/local/src/linux-2.6

and see the OOM killer in action.

50. Repository For The r8169 Kernel Driver

25 Apr 2005 (1 post) Subject: "[rft] repository for the r8169 driver"

People: Francois Romieu

Francois Romieu said:

A repo for the r8169 stuff is set up at :


It is (supposedly) derived from a recent 2.6.12-git repo. I'd appreciate if someone could report whether it's usable or not.

51. Cogito 0.8 Released With New Name, Major UI Changes

25 Apr 2005 - 26 Apr 2005 (15 posts) Archive Link: "[ANNOUNCE] Cogito-0.8 (former git-pasky, big changes!)"

Topics: cg-pull, git fork

People: Petr Baudis

Petr Baudis said:

here goes Cogito-0.8, my SCMish layer over Linus Torvald's git tree history tracker. This package was formerly called git-pasky, however this release brings big changes. The usage is significantly different, as well as some basic concepts; the history changed again (hopefully the last time?) because of fixing dates of some old commits. The .git/ directory layout changed too.

Upgrading through pull is possible, but rather difficult and requires some intimacy with both git, git-pasky and Cogito. So probably the best way to go is to just get cogito-0.8 tarball at


build and install it, and do

cg-clone rsync://

Yes, this is a huge change. No, I don't expect any further changes of similar scale. I think the new interface is significantly simpler _and_ cleaner than the old one.

First for the concept changes. There is no concept of tracking anymore; you just do either cg-pull to just fetch the changes, or cg-update to fetch them as well as merge them to your working tree. Even more significant change is that Cogito does not directly support local branches anymore - git fork is gone, you just go to new directory and do

cg-init ~/path/to/your/original/repository

(or cg-clone, which will try to create a new subdirectory for itself). This now acts as a separate repository, except that it is hardlinked with the original one; therefore you get no additional disk usage. To get new changes to it from the original repository, you have to cg-update origin. If you decide you want to merge back, go to the original repository, add your new one as a branch and pull/update from it.

As for the interface changes, you will probably find out on your own; cg-help should be of some help. All the scripts now start with 'cg-', and you should ignore the 'cg-X*' ones. The non-trivial mapping is:

        git addremote -> cg-branch-add
        git lsremote -> cg-branch-ls
        git patch -> cg-mkpatch
        git apply -> cg-patch
        git lsobj -> cg-admin-lsobj

Commands that are gone:

        git fork
        git track

New commands:


Of course other changes include various bugfixes, and latest Linus' stuff (although we do not make use of Linus' tags yet).

Note that I don't know how many time will I have for hacking Cogito until the next Sunday/Monday. I hope I will get some time to at least apply bugfixes etc, but I don't know how much more will I be able to do. You would make me a happy man if you could please port your pending patches from git-pasky to Cogito; I promise to apply them and I hope there isn't going to be another so big change in the foreseeable future, which would cause major conflicts for your patches etc.

Note that I cc'd LKML since it is going to break stuff for anyone using git-pasky now (apologies for that; it won't happen another time). Please try not to keep it in the cc' list unless it is really relevant.

Mike Taht suggested that with such big changes going in, it would probably be a good time to reorganize some directories. Petr replied:

Actually, I've been thinking about this, but I think we just don't need it *yet*.

And by the time we will need to make things more hierarchical, we will hopefully have some way to deal with renames sensibly. We need something for that too - either something ultra-smart as Linus describes, or explicit renames, but merging not working across renames makes them total nightmare.

52. Cogito 0.8.0 Debian Package

25 Apr 2005 (1 post) Subject: "[PATCH] cogito debian dir"

People: Joshua T. Corbin

Joshua T. Corbin said, "Debianization of cogito 0.8.0, binary .deb can be downloaded from:"

53. Cogito Tutorial

26 Apr 2005 - 27 Apr 2005 (11 posts) Subject: "Cogito Tutorial If It Helps"

People: James PurserBenjamin Herrenschmidt

James Purser said:

I reworked the previous tutorial to take in the changes in the scripts. Will make this a series of tutorials to cover all aspects. Any suggestions or hints or spelling corrections would be most welcome.

Petr Baudis was thrilled to see this, but Benjamin Herrenschmidt complained, "this tutorial is exactly like cogito's own readme as far as I'm concerned : it just makes things even more confusing to me. I must be really stupid, I should stick to hacking the kernel and not try to use userland tools :)" But after banging his head against it for awhile, Benjamin started figuring things out.

54. git-Related Web Site

26 Apr 2005 (1 post) Subject: "Website"

People: Denny Schierz

Denny Schierz said:

my name is Denny Schierz and we (at the moment four german people) want to give you, and the community, something back. So we decided to create a website ( We hoping that we can help to build a interface between the developers from git, and the users for feedbacks, wishes, tutorials, etc.

At the moment, we're structure our ideas how to build the page (with typo3 CMS), including:

and more

55. git Header Parsing; Some Explanation Of Design Philosophy

26 Apr 2005 - 28 Apr 2005 (29 posts) Subject: "A shortcoming of the git repo format"

Topics: BitKeeper, CVS, SHA1, fsck

People: Linus TorvaldsTom LordH. Peter Anvin

In the course of discussion, some folks complained that the git headers were too arbitrary, and it was difficult to know how to parse them. There was some speculation on what to do for this, and Linus Torvalds said:

For a "commit", the format is

There is no free-format text _anywhere_ that git parses. No room for guesses, no room for mistakes, no room for anything half-way questionable.

And fsck actually enforces this. We do _not_ just use "gets()" to read one line at a time. We literally verify that the lines are 46/48 bytes long, and have the delimeters in the expected places.

Same goes for "tree" and "tag" objects. They all have fixed-format stuff. A "tree" entry is always

"%o <space> %s" \0 [ 20 bytes of sha1 ]

with "%o" being "mode", and "%s" being "path". We don't guess.

And this really is _important_. Exactly because we name things by the SHA1 hash of the contents, we MUST NOT have flexible formats. Having a format which allows non-canonical representations (extra spaces etc) would mean that two trees that were identical would depend on how you happened to format them.

So there's really two issues:

For example, another rule is that a "tree" object is always sorted by the bytes in the filename (not by entry, btw: a directory called "foo" will sort as "/git/foo/index.html", even though the _entry_ only shows "foo"). That rule not only makes a lot of operations faster, but again, it means that there is only _one_ way to represent a tree validly.

IOW, you _cannot_ represent a tree any other way (and I've been too lazy to check this in fsck, but it's alway sbeen my plan), and that is exactly why we can just compare the hashes of the results - because there is no random component of "layout" in the contents.

This really is important. It means that if you get to the same two tree contents in totally unrelated ways (you unpack a tar-file and encode it in git, or you have 5 years of git history and check it out), the "tree" will match _exactly_. There's no history. There's no "optional" stuff. Since the contents of the trees are the same, the SHA1 of the two trees will be the same. Exactly because git refuses to touch any free-format stuff.

H. Peter Anvin felt that such an unchanging format was brittle, and Tom Lord of the Arch project piped up, "I think one has to understand Linus' posts as coming from the "head-down, steaming ahead for *MY* project cause you all suck" perspective and impose corresponding filters on his declarations of "LAW". At least that's the only way *I* can make sense of his latest contributions." H. Peter said, "It's fair for Linus to want to make things behave a certain way in a project. There are design decisions which have tradeoffs both ways -- robust (but subject to partial information issues) versus brittle (but safe.)" And Linus said to Tom:

I'm really being very head-strong on these things, and much more so than I normally am, because quite frankly, I see "git" as a very different project from Linux.

(Which is not to say that I'm not opinionated even normally, but I'm normally a bit more open to listen to other people ;)

There's two huge differences between git and Linux, and I'm really sorry if they make me act as an asshole, but they are important to me:

In other words: I _know_ that I've been unpleasant. I'm sorry about that, but I am trying to explain _why_ I'm being an asshole about things, more so than I usually am.

I'm not actually all that interested in SCM's. I'd have been much happier if I never had to start doing git in the first place. But circumstances not only forced me to do my own, it also so happens that I don't believe that there are many people around that have ever really _seen_ what my kind of development requirements are.

What does that boil down to? It means, for example, that to me it doesn't matter one _whit_ if you've been doing SCM's for the last thirty years, and you can do xdelta algorithms in your sleep.

Quite the reverse: such a person "knows" a lot of things, but I'm pretty damn sure that such a person has _never_ actually worked on a system that works the way the kernel development does, which means that most of the things that person "knows" are things that may need to be un-learnt.

And because I don't actually _care_ about SCM's, and only care about getting to the point where I (once more) don't have to even think about the SCM that I use for the kernel, I also don't have much incentive to worry about CM models that may well be very valid outside of kernel work.

See? When it comes to my Linux work, I'm very inclusive. Linux already does everything _I_ need it to do, so in many ways, all that really motivates me to improve it are really about other peoples needs, and as such, I'm really really interested in what _other_ people want. I still say "no, that's now how we do things", but that's much less contentious.

In contrast, with git, I'm totally uninterested in anything that doesn't make my kernel work go faster or more smoothly, and does so _today_. Which makes me a cantancerous old bastard, and bit the heads off anybody who isn't focused on that one thing.

And I really _am_ sorry. I don't actually _like_ being nasty about these things. But when it comes to git, I have one motivation, and one motivation only, and being nice about it isn't going to help.

The good news? I actually think my needs are very basic. Once gits gets to the point where it does what I need it to do, I don't really have any motivation to say "this is how we do it" any more. And I think we're actually getting to that point fairly soon. That's not saying git is "done", any less than Linux was "done" in 1992. It's just that at that point I don't have any reason to be a nasty control freak any more.

In fact, I don't see myself even maintaining the project, especially since there seem to be others that are more motivated to do so than I am. Then I'll just go back into my dark kernel cave, and hopefully I don't have to come out again for a while.

But for now, the _only_ point of git is as a kernel maintenance tool. There are tons of other SCM systems that are probably better for other projects, so if git is "just another SCM project", then git is totally pointless. So for now, the absolutely _only_ thing that matters for git design (as far as I'm concerned) is "how well does it suit Linus".

56. Identifying Changesets Affecting A Give File

27 Apr 2005 - 28 Apr 2005 (25 posts) Subject: "Finding file revisions"

Topics: FS: ext2, FS: ext3, diff-tree

People: Chris MasonLinus Torvalds

Chris Mason said:

I haven't seen a tool yet to find which changeset modified a given file, so I whipped up something. The basic idea is to:

for each changeset in rev-list
        for each file in diff-tree -r parent changeset
                match against desired files

Is there a faster way? This will scale pretty badly as the tree grows, but I usually only want to search back a few months in the history. So, it might make sense to limit the results by date or commit/tag.

file-changes [-c commit id] file1 ...

The file names can be perl regular expressions, and it will match any file starting with the expression listed. So "file-changes fs/ext" will show everything in ext2 and ext3.

Example output:

diff-tree -r 56022b4d00cae3ff816d3ff05d9f8a80e1517c60
cat-file commit 9bd104d712d710d53c35166e40bd5fe24caf893e
    tree cd4e40eae003e29c0d3be2aa769c3b572ab1b488
    parent 56022b4d00cae3ff816d3ff05d9f8a80e1517c60
    author mason <mason@coffee> 1114617717 -0400
    committer mason <mason@coffee> 1114617717 -0400

    comments go here

This is meant for cut n' paste. If you find a changeset comment you like, run the diff-tree -r command on the first line to see a diff of the changeset (maybe I should add | diff-tree-helper here?)

Linus Torvalds felt there was a faster method. He said:

Tell "diff-tree" what your desired files are, and it will cut down the amount of work by a _lot_ (because then diff-tree doesn't need to recurse into subdirectories that don't matter).

So you should just do

        for each changeset in rev-list
                diff-tree -r parent changeset <file-list>


Chris and others worked on an implementation, arguing a bit over how best to go about it.

57. Tracking File Changes With diff-files

27 Apr 2005 - 28 Apr 2005 (6 posts) Subject: "[PATCH] add a diff-files command"

Topics: diff-cache, diff-files, diff-tree, show-diff

People: Nicolas Pitre

Nicolas Pitre said:

In the same spirit as diff-tree and diff-cache, here is a diff-files command that processes differences between the index cache and the working directory content. It produces lists of files that are either changed (-c), deleted (-d) or outside (-o) from the current cache, or a combination of those, or all of them (-a).

The -p option can also be used to generate a patch describing the changes directly.

It also has the ability to accept exclude file patterns with -x and even a file containing a list of patterns to exclude with -X. This is especially useful to use the famous dontdiff file when looking for uncommitted files in a compiled kernel tree.

Junio C. Hamano felt that the same functionality could be achieved by grepping through the show-diff output, and that therefore diff-files was not needed. Nicolas replied:

show-diff doesn't handle files in the work tree which are not listed in the cache.

Have you ever looked at the dontdiff file? You can get a sample of it from to give you an idea. Using grep or filterdiff is really backward in that case since out of all the junk that might appear in the output about 98% will be filtered away in most useful cases, which is rather inefficient.

Path restriction is inclusive, while the exclude list is, well, exclusive. They serves separate purposes. So trust me it _is_ pretty damn useful, unless you always run "make clean" on your kernel tree before checking for potentially uncommitted files then recompile everything afterwards which is a hassle.

Junio just didn't see any point to this feature. But Nicolas said, "I don't do this out of pure entousiasm but rather trying to make my own workflow with the Linux kernel source tree more efficient in the context of git usage." The two were not able to agree during the thread.

58. Importing An mbox Full Of Patches

27 Apr 2005 (3 posts) Subject: "import mbox?"

Topics: dotest

People: Linus Torvalds

Jeff Garzik asked if there were a script he could use to grab all the patches out of an mbox file, and apply them to his local tree. Linus Torvalds replied:

I've got a "tools" project at

which has my old "dotest" scripts re-written for git (yeah, yeah, three years and two generations later, I still call the damn thing "dotest". I'm not very good at this tool naming thing ;)

59. Developers Wrestling With The Removal Of 'git fork'

27 Apr 2005 - 28 Apr 2005 (5 posts) Subject: "Git fork removal?"

Topics: cg-diff, cg-pull, git fork

People: Daniel BarkalowPetr Baudis

Daniel Barkalow asked perplexedly, "I saw that "fork" was removed when going to the cg- scripts, and the replacements don't do the symlinked trees thing. I found the symlinked trees thing vital to my workflow, so I'm going to want to reintroduce them, or something similar. Is there some reason you went to hardlinked object files instead of symlinked directories?" Petr Baudis replied:

The user. ;-)

Apparently, too many people were confused by the local/remote branches distinctions, and even I ceased to like it gradually (BTW, Cogito still supports working with them - it just does not offer any interface for manipulation with them). The current scheme is much simpler and I believe more clear.

Also, the forked repositories were not truly independent - people actually got burnt by forking and then removing the original repository.

If this breaks your workflow, could you please describe it? Perhaps we could find a good semantics to support both.

Daniel replied:

The part that I'm worried about is the way I turn a mass of debugging and little local commits into a clean patch series. I've got a working fork "barkalow", which is the result of a bunch of stuff and a dozen commits. It is derived from "linus". I want to split up the changes and make a series of commits, each of which will be a patch to submit.

  1. I fork "linus" into "for-linus". I go into "for-linus".
  2. I do "git diff this:barkalow > patch". This gives me the complete set of changes I want to submit.
  3. I cut down the diff to a single logical change by removing all of the other hunks.
  4. I do "git apply < patch". I do "git commit". I describe the logical change.
  5. I go back to step 2, unless I'm done.
  6. For each of the commits between "linus" and "for-linus", I do "git patch <commit>", and send out the result.

The thing that I think requires the symlinks is step 2, which requires that there be somewhere I can run git and have it able to see a pair of unrelated local heads and the relevant trees.

Petr replied:

Just do cg-pull barkalow, to get the latest changes from that repository (perhaps clone should inherit branches information?).

But if you want Linus to pull from your tree, you generally want it to be clean - that is, you want to manage clean separation (as Pavel Machek describes in his document). That is another advantage of hardlinking - you don't get any unrelated stuff in if you don't explicitly pull it, so you can keep your for-linus branch clean. I'd do cg-diff linus:this in the barkalow branch instead to keep this property.

Daniel was surprised to learn that it was possible to pull from a local repository; he said this seemed like it would be confusing to users. He also added that Petr's solution "doesn't work; when I'm preparing the second patch in the series, I want to compare linus+patch 1 against barkalow, so that I'm looking at what's left to split. That's why I need to have the unrelated heads, not just the linus head and my head based on it. If I go back to linus each time, it's more work making the patches and I don't have an easy way of telling whether I've included the same part twice or missed a part."

The thread ended here.

60. gitweb On; Sorting Commits

27 Apr 2005 - 28 Apr 2005 (23 posts) Subject: " now has gitweb installed"

Topics: SHA1, convert-cache

People: David WoodhousePetr BaudisLinus TorvaldsH. Peter Anvin

H. Peter Anvin gave a pointer to git repositories on; Daniel Jacobowitz was thrilled to see this, but David Woodhouse remarked, "Looks like the ordering is wrong. A chronological sort means that commits which were made three weeks ago, but which Linus only pulled yesterday, do not show up at the top of the tree." Petr Baudis replied graphically:

  Linus                     ASM (Anonymous Subsystem Maintainer)

   A|                        |B
    |                        |
    |                        \-------------\
    |                        :             |
    \------------------------\             |E
   C|                        |D            |
    |                        /-------------/
    |                        |F

How would you show that? F E D C B A? F D C A E B?

David replied:

Let us assume that C and A were already in Linus' tree (and on our web page) yesterday. Thus, they should be last. The newly-pulled stuff should be first -- FEDBCA.

I'd say "depth-first, remote parent first" but that would actually show show 'A' (as a parent of D) long before it shows C. Walking of remote parents should stop as soon as we hit a commit which was accessible through a more local parent, rather than as soon as we hit a commit which we've already printed. Maybe it should be something like depth- first, local parent first, but _reversed_?

The latter is what the mailing list feeder does, but that has the advantage of being about to use 'rev-tree $today ^$yesterday' so we _know_ we're excluding the ones people have already seen. Hence I haven't really paid that much attention to getting the order strictly correct.

(Yes, I know that strictly speaking, git has no concept of 'remote' or 'local' parents. But the ordering of the two parents in a Cogito merge or pull hasn't changed, has it?)

He went on:

Walk the tree once. For each commit, count the number of _children_. That's not hard -- each new commit you find below HEAD has one child to start with, then you increment that figure by one each time you find another path to the same commit.

When printing, you walk the tree depth-first, remote-parent-first. If you hit a commit with multiple children, decrement its count by one. If the count is still non-zero, ignore that commit (and its parents) and continue. If the count _is_ zero, then this is the "most local" path to the commit in question, so print it and continue to process its parents...

(Actually I'd probably do it by adding real pointers to the children instead of using a counter. Operations like convert-cache would be far better off working that way round, and 'cg comments' is going to need to do something very similar to convert-cache.)

Linus Torvalds replied:

No, that really sucks.

Realize that "remote" and "local" parents don't really exist. They have no meaning. I've considered sorting the parents by the sha1 name, but I've left that for now.

Anyway, the reason remote and local don't matter is that if somebody else merges with me, and I just pull the result without having any changes in my tree, we just "fast-forward" to that other side, because otherwise you can never "converge" on anything (people merging each others trees would always create a new commit, for no good reason).

What does that mean? It means that my local tree now became the _remote_ parent, even though it was always local to my tree.

So if you look at remote vs local, you're _guaranteed_ to mess up. It has no meaning.

So what you can do is:

Really. You say that dates don't matter, but they _do_ actually matter a lot more than "remote/local" does. At least they have meaning.

The whole 'dates have meaning' issue turns out to be tricky, because revision control people have been saying for years that local time is not reliable as data. David pointed this out, saying that "using the date isn't any better. It'll give results which are about as random as just sorting by the sha1 of each parent. Yes, the ordering of the parents in a merge is probably meaningless in the general case, but so is the date." He went on, "The best we could probably do, from a theoretical standpoint, is to look at the paths via each parent to a common ancestor, and look at how many of the commits on each path were done by the same committer. Even that isn't ideal, and it's probably fairly expensive -- but it's pointless to pretend we can infer anything from _either_ the dates or the ordering of the parents in a merge." Linus replied:

Wrong. The date _does_ have meaning. It shows which of the parents was more recent, which indirectly is a hint about which side had more activity going on.

In other words, it _is_ meanginful. Maybe it's a _statistical_ meaning ("that side is probably the active one, because it has the last commit"), but it's a meaning.

David replied:

It's not entirely clear what 'active' is supposed to be useful for in this instance. You could just as well count the commits between the merge and the common ancestor, if you want to see which side was most _active_ -- but that isn't helpful for deciding the order in which 'cg-log' should show commits.

What you really want there is 'local' vs. 'remote', because people want to see the order in which changesets arrived in the _local_ repository -- if the last thing you did was pull from me, people want all my changesets to be at the top; regardless of who last committed to their tree before the merge -- i.e. regardless of whether I did a last-minute commit before you pulled, or whether you'd done another commit to your tree immediately before pulling.

As you rightly point out, the local/remote information isn't really available in an easy form -- certainly not from the ordering of the parents in a merge commit. But let's not fool ourselves that we can piece it together from the date either.

OK, the date _is_ meaningful in a way, but only in the same way that the author's name and IRC address information is meaningful. Of course we didn't include it for _nothing_, but it's outside the scope of git itself; it isn't part of the useful information which git should care about.

H. Peter Anvin asked if David had meant that he wanted "a primary search criterion which is "when did event X become visible to me", where "me" in this case is the web tool. That is not repository information, but it is perfectly possible for the webtool to be aware of what it has previously seen and when. And yes, this ordering is clearly different for each observer." Linus replied:

This is exactly what rev-tree does, and how things like the commit emails happen.

The problem is that since it's observer-dependent, it's not generally very useful for something like a web interface. You really don't want to keep track of what everybody has seen ;)

What you _can_ try to keep track of is what some "special observer" has seen. That's really quite complicated too, but if you do a web interface, the "special observer" is yourself. Then at every time you mirror the thing, you need to remember what your "last view" was, and you base your "new view" on the fact that you know what you saw last time, so you know which things are new to _you_.

But it really means that each web interface ends up showing quite _different_ information, and the particular information you show ends up being dependent on when you started looking at the tree (and how often you re-generate new views).

This really is why "time" is interesting. Because it's simple, and observers can agree about it (not because the time was the same, but because each observer just agrees that time is "whatever was reported as the local time at the point the action happened").

61. Exclude Pattern In show-files

27 Apr 2005 - 28 Apr 2005 (4 posts) Subject: "[PATCH] add a diff-files command (revised and cleaned up)"

Topics: diff-cache, diff-files, diff-tree, show-diff, show-files

People: Nicolas PitreLinus Torvalds

Nicolas Pitre said:

In the same spirit as diff-tree and diff-cache, here is a diff-files command that processes differences between the index cache and the working directory content. It produces lists of files that are either changed, deleted and/or unknown with regards to the current cache, content. The -p option can also be used to generate a patch describing the differences in patch form.

It also has the ability to accept exclude patterns for files and the ability to read those exclude patterns from a file.

Typical usage looks like:

diff-files --others --exclude=\*.o arch/arm/ include/asm-arm/

which lists all files the git cache doesn't know about in arch/arm/ and include/asm-arm/ but ignoring any object files. Or:

diff-files --all -p --exclude-from=dontdiff.list

which produces a patch of all changes currently in the work tree while excluding all files matching any of the patterns listed in dontdiff.list (useful when one doesn't want to run 'make distclean').

Linus Torvalds replied:

I really think the current "show-diff" does that very well, and what you're doing is really different.

I think this thing is really a replacement for "show-files", which is a piece of crap (hey, I wrote it, but I don't have to be proud of it), and which really was meant to be more of what your diff-files is.

The thing is, I really don't want the "core" diff-xxx programs to worry about exclude patters, and current directory contents. They do one thing, and one thing only: compare the files they were explicitly told to compare.

HOWEVER, there clearly is a separate problem, which is what "show-files" currently does very badly (and not at all in some cases), which is the "ok, what about the _other_ files?"

And once you start talking about files that are _not_ mentioned in the index, now you really do have something totally different, and now it does need to be able to have exclude patterns to know to avoid object files and other crud that we know we're not interested in).

But for the crud we don't know about, we're not really interested in the diff against something we _do_ know about. So I think that the whole "--others" and "--all" thing is wrong (yeah, yeah, it was me that started it with show-files), and that this thing should always _only_ look at files that aren't mentioned in the index file (ie "others" is always enabled, and "all" is pointless).

Because those are special files: they are files we don't know what to do with (conversely, files that _are_ mentioned in the index but don't actually seem to show up are interesting for the exact same reason).

That set of files is interesting for several reasons:

This was all stuff that "show-files" was kind of supposed to work up to, but I just couldn't find it in myself to be interested enough.

Nicolas replied that it was precisely the "ok, what about the _other_ files" problem that his patch was intended to address. He posted a new patch, to "Give show-files the ability to process exclusion pattern." Junio C. Hamano liked this.

62. Intricacies Of Blob Parsing

27 Apr 2005 (10 posts) Subject: "[0/5] Updates for library, fsck-cache"

Topics: SHA1, fsck-cache

People: Daniel BarkalowLinus Torvalds

Daniel Barkalow posted a patch, explaining, "We don't parse blobs at all, so any that we've got are as parsed as they're going to get. Don't make fsck-cache mark them." Linus Torvalds erupted:


This is WRONG, dammit. I fixed it once, you are re-introducing the same bug.

Daniel, the problem is that you parse them only when you SEE them, and that is totally different from having seen a REFERENCE to them. One says "I've seen this object", the other says "I _want_ to see this object". They are two totally different things.

You now mark all "blob" objects parsed regardless of whether you have actually seen the blob or not. Ie you mark a blob parsed just from having seen a _reference_ to it, and fsck can never know whether it actually really saw the object or not.

This is the commit that already fixed this bug once, and that you are now re-introducing:

        commit 4728b861ace127dc39c648f3bea64c3b86bbabc5
        tree 242227fc3c3a74d070ed36496e790335dd00c44a
        parent da6abf5d9c342a74dffbcc2015b9c27d7819a900
        author Linus Torvalds <> Sun, 24 Apr 2005 14:10:55 -0700
        committer Linus Torvalds <> Sun, 24 Apr 2005 14:10:55 -0700

            fsck-cache: notice missing "blob" objects.

            We should _not_ mark a blob object "parsed" just because we
            looked it up: it gets marked that way only once we've actually
            seen it. Otherwise we can never notice a missing blob.

please think about it.

Try to make some test-cases for fsck. They are quite easy to make: copy a good directory, and

And you'll see how this "consider a blob parsed" totally destroys fsck's ability to notice that the blob doesn't even _exist_ any more (case 3 above).

"parsing" and "looking up" are two totally independent operations. They are independent for commits and trees, and they are independent for blobs.

To mark a blob parsed, you _need_ to have actually looked it up and verified that it exists and that the object header is valid (and if you're fsck, that the sha1 matches). You MUST NOT do it in "lookup_blob()".

Daniel posted a new patch, that "eliminates the special case for blobs versus other types of objects. Now the scheme is entirely regular and I won't introduce stupid bugs. (And fsck-cache doesn't have to do the do-nothing parse)"

63. New gitkdiff Commit Viewing Utility

27 Apr 2005 - 28 Apr 2005 (4 posts) Subject: "[ANNOUNCE] gitkdiff 0.1"

Topics: BitKeeper, git-viz, gitkdiff

People: Tejun HeoIngo Molnar

Tejun Heo said:

I've hacked tkdiff and made a commit viewing utility. Just download the following tarball and unpack it whereever PATH points to. It assumes that all base git executables are visible via PATH.

$ gitkdiff -h
/home/tj/bin/gitkdiff: illegal option -- h
GIT tkdiff - gitkdiff 0.1

Usage: gitkdiff [OPTS...] DIFFSPEC

OPTS are
    -h                      prints this help message and exit

DIFFSPEC can be one of
    [files...]              the current cache vs. working files
    -r R [files...]         files in commit R's parent vs. files in commit R
    -r R0 -r R1 [files...]  files in commit R0 vs. files in commit R1

If no file is specified, all modified files are shown.

Greg KH was very happy to see this, as was Ingo Molnar, who added, "there's only one other utility i'm missing: a tool that does the equivalent of 'bk annotate' - and to possibly integrate it with gitkdiff and git-viz. That would make 'history browsing' very powerful: to flexibly switch between changeset history graph view, annotated file view and changeset history within one utility." Tejun replied, "Actually, I am thinking about making a full gui history thing. With commit history graph, annoatated file history and all those stuff (I think it will look a lot like bk revtool). Can't say how long it will take but maybe in a week. So, if you have some ideas/suggestions, please let me know."

64. Identifying And Diffing Against Tags

28 Apr 2005 (9 posts) Subject: "diff against a tag ?"

Topics: Compression, SHA1, diff-tree, fsck

People: Dave JonesJunio C. HamanoLinus TorvaldsMorten Welinder

Dave Jones asked, "Is there an easy way to express 'show me the diff between HEAD and 2.6.12rc3' today ? Looking at the commit for rc3, theres nothing obvious to me distinguishing it from any other commit other than the "Linux v2.6.12-rc3" in the description, which makes it somewhat difficult to automate." Junio C. Hamano replied, "with the patch below I sent today, you can say "diff-tree -p $tag $(cat .git/HEAD)"." Linus Torvalds replied:

I think Dave was wondering how to _find_ the tag in the first place, which is a different issue.

Right now fsck is the only thing that reports tags that aren't referenced some other way. Once you know the tag, things are easy - even without Junio's patch you can just do

object=$(cat-file tag $tag | sed 's/object //;q')

and then you can just do

diff-tree $object $(cat .git/HEAD)

or whatever you want to do.

Dave: do a "fsck --tags" in your tree, and it will talk about the tags it finds. Then you can create files like .git/refs/tags/v2.6.12-rc2 that contain pointers to those tags..

Dave confirmed that finding the tag was what he'd really been after. He asked, "Is it 'THE LAW' that the first one it reports will always be the most recent tag?" Linus replied:

Nope. They are reported in the order they are found, which is not meaningful at all (it depends on the directory ordering, with the highest-level bits being obviously ordered by the sha1 number thanks to the subdirectory stuff).

So you can only see the name of the tag - there's no ordering. The tag-name may of course imply an ordering in itself..

Elsewhere, Morten Welinder asked Linus:

why wasn't the type of the object made part of the file name?


That would have simplied scripts a good deal and listing all tags would be trivial.

Linus replied:

it would just have complicated the code and made it less flexible.

As it is, we can open an object by just knowing its name, and by "name" I mean the true one, the SHA1. No need for "open_blob()" or anything like that. You just do

sha1_file_name(const unsigned char *sha1)

to find out what the filesystem name of the object is, and this works for _any_ object.

And thanks to this, we can pass in a sha1 _without_ knowing whether it's a tree or a commit (or a tag) and just open in. Then, when we figure out that it was a commit rather than a tree, we look up the tree instead.

Being able to open files _without_ knowing what they are is hugely useful. The user passes you a name, and you can just do the right thing.

Besides, I _still_ don't want scripts mucking around with the objects directly. Remember? They're encrypted with my super-sekrit zlib encoder ring, just because people shouldn't be mucking around in them.

Trust me, when the object directories have a million files, you'll thank me. You do _not_ want to do a readdir and try to figure out tags that way. You want to do it the way I _force_ you to do it, ie the right way.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.