Git scrutinized

jpetso's picture

There's a great introduction on git for SVN people like me, which made it twice as easy for me to look into how this thing works. Git only recently released their 1.5 version which is the first one that's supposed to be usable to the masses. (It might not yet be available pre-packaged for your Linux distribution, or available at all if you're running Windows, which could be a small hurdle at the beginning.) After reading the introductory couse and trying it out by myself, I must say I'm hooked.

For those who didn't know, git is the distributed RCS that was created by Linus and the other kernel folks because they needed to get rid of BitKeeper, and as the Linux kernel is a very demanding project both in code size and in patch management, git is quite capable indeed from an efficiency point of view. Currently in use by the Linux kernel itself,, Wine, and One Laptop Per Child, to name a few popular projects.

As promised in my SoC application, here's a short rundown of features that are important to this abstraction layer.


Nothing missing here, and totally straightforward: by executing 'git command', you get all those nifty features like renaming, copying, removing, branching, tagging, reverting, committing, diffing, and yes, even merging. (The latter is very cool, as it not only obsoletes 'patch -p0', but also uses 3-way-diffs by utilizing the file history, in order to reduce or completely eliminate conflicts. And like 'git diff' or 'svn diff', it's completely client-side.) Commits are intuitively called "commits" (surprise!), and are identified by an SHA-1 hash of all the files contained in a commit. (Mercurial also uses SHA-1 hashes, probably a few other distributed revision control systems as well.)

Semi-off-topic: Thinking about (and looking up) revisions a bit more, it seems that CVS and Microsoft Visual SourceSafe - which shall be notoriously disregarded due to its scary feature set - that those two are about the only revision control systems that don't assign identifiers to a commit itself. So, as we really want to retrieve commit information just by looking up data from our {rcs_commits} table, I'd find it a good idea to also incorporate the revision identifier into this table. As a nullable column, so that CVS doesn't need to write data there.

Branching, tagging and file structure

Git does this quite similarly to CVS. Therefore, there's no prescribed or even recommoned repository layout like with Subversion where you put your data into trunk/, branches/ and tags/. Nothing new here, hardly worth a mention.

Authentication and hook scripts

Like SVN, git itself knows nothing of authentication but can be wrapped in ssh or https, and those manage authentication by themselves. As for controlling which branches and tags are being created, that can be achieved by server-side commit hooks just as well as the current CVS integration does it. Git offers a wide range of low-level commands which can retrieve all info you ever want to know about the current commit, and using those in the hook scripts does get you somewhere.

Distributed repositories

As expected, this is the major difference to traditional, client-server-centric revision control systems. The fact that git works in a distributed way means that everyone who works on the code carries a full copy of the whole repository, including the complete version history. As git stores both data and difference information in a ridiculously efficient way (in fact, git only stores data and extracts diffs out of that data), hard disk space isn't the great problem that you would think it is. It's even less a problem than it is for Subversion, because the latter stores all of your checkout uncompressed in hidden .svn/ folders, so that it can do serverless diffs.

So space is not really an issue here, not even with a still not too huge codebase like Drupal's. However, it's kind of tedious if you always need a full checkout of the repository, and that's what you get when you get ("clone") your initial checkout from the server. Imagine how you would need for all of contrib's modules, themes and translations if you just want to work on your two little pet projects.

Right, not cool. So best practice for git would mean to create a single repository for every single project, and instead of checking out the two paths where your pet projects reside, you'd get two complete repositories. This adds a new facet to our requirements, as it is now necessary to be able to manage hundreds of repositories, instead of just a few. Maybe even with automatic creation together with new projects. And of course, every single repository comes with its own set of hook scripts - granted, those could be symlinked if you want to centralize them.

(Mind that is not going for git anytime soon, I'm just using it as a good use case.)

I also found out that the ominous reponame.git is not a file, but just a directory containing the current repository; the same thing that normally resides in the root of your working copy as hidden repo/.git/ directory. This is called a bare repository, and meant for public repositories on some server like the one that provides. (Remember, repositories are otherwise meant to be carried around together with your own working copy.)

The coolest feature that comes with distributed repositories is certainly the flexible patch management facilities that you get by committing to your own local repository and only pushing it to the public one later. Or maybe fetching sources from someone else than the original server. In any case, those are arguably client-side and don't affect this project which is clearly aimed for the server.

In other news

I created two of the projects that I'm going to need for my RCS modules, so that the database schema a.k.a. .install files are under revision control. Say hello to Revision Control API and Project RCS integration. Missing the actual module, but featuring .install, .info and README.txt files. I had planned to create a new "CVS backend" (short name: "cvs") project as well, but then came to me the reason why "cvslog" is not simply named "cvs" in the first place. Tricky stuff, I tell you ;-)

Proposals for the backend modules' naming scheme? How about "xxx_backend"? (Clean, but slightly ugly as this makes for a hook implementation which bears the sounding name cvs_backend_rcs_backends().) I can't currently think of any better solution.

I'm also playing with the idea of splitting out another module that would provide an interface (only optionally implemented by the backend) to "RCS Access Control", short name: "rcs_access". Which would also provide those motivation forms. However, I need this to rest for some time and decide later if it's necessary, it can always be split out from the CVS backend later on.

Upcoming work

Seems there are only two questions left since my previous post. And one of them is really marginal. Recap:

  • Need a good solution on how to handle branches and tags. {cvs_tags} is not the ideal solution, given that branch != tag in most RCS. Neither can be attached to the commit itself, because it can potentially span multiple branches, including HEAD/trunk/whatever. (Which is not clean, but possible.)
  • Is any extra care needed for directories? I think we can just treat them as files, like is standard on UNIX, but it's certainly possible that I'm missing something.

Before I head off to drafting the RCS API, I still want to take a look at Mercurial which seems to be a promising RCS as well. Maybe a short glance into Bazaar, too. And the need to answer the above questions is still there.


great write-up

dww's picture

so far, looks good. i can't comment on your naming convention questions yet, since it's too late and i'd need to think about it more. but, i'm pleased that you're doing this research and i think it'll definitely help improve the RCS API. nice work.

Appendix: tags and branches, directory handling

jpetso's picture

I figured there is indeed something more to say about tags and branches, which is the scope of those. Each branch applies to the whole repository instead of just a subdirectory, and for tags this is the standard as well, even if you can tag single files as well.

Also note that git only tracks file revisions, but has no notion of directories other than the fact that they contain revision controlled files. That's also the reason that you cannot tag whole directories, only the files that are in there.

What might be clear, but should maybe also be said, is that CVS' sick "sticky tags" (binding the local checkout to a specific remote branch and losing this information if an update from any other branch is fetched) are not used in any other RCS that I know of, and that's a good thing, because branches always ought to be "sticky". What you can do in git is to bind a branch (not the whole checkout) to a specific remote branch which is then used as source for 'git pull', and that's quite sane.

SVN with its "each branch is only a directory" approach doesn't know any of those, as "binding" branches occurs implicitely by having a separate directory for it.

Side note: the "Post new comment" link was not working for me, so I used the "reply" one.

Oh damn

jpetso's picture

Side note: the "Post new comment" link was not working for me, so I used the "reply" one.

I am sooo stupid.

Still more on branches

jpetso's picture

Trying to figure out how to represent branches in the API, I had another look at branching in git. Maybe the claim that git does branching and tagging similar to CVS is a bit of an overstatement, as there are a couple of important differences.

Important difference #1: Branches are not made for eternity. Like in every advanced RCS (that means, not in CVS), branches can be deleted when their purpose has been fulfilled. You don't normally do this with stable branches like DRUPAL-5, but for working branches this practice is pretty widespread. In git, you can even rename branches.

These capabilities have the consequence that a branch might not exist anymore when we retrieve details about a revision that was committed some time ago. Also, git does not track the history of branches and tags, it just tracks commits. So when a branch has been deleted or renamed, you can't reliably retrieve the branch that it was committed to. (That is, if you don't log it in the database.)

Important difference #2: A commit can be contained by multiple branches. With git, you can merge branches while retaining history, so you can for example merge your working branch "newfeature" into DRUPAL-5 and HEAD (which would be named "drupal-5" and "master" in git, but hey). You'd probably even delete your "newfeature" branch afterwards.

So if you later want to retrieve the branch that this commit belongs to, you can't have it, because you'll get the branches that this commit belongs to. Which means that any branch info in our data structures must be an array, not a single value.

It seems that because of their volatility, it could be a good idea to store branches and tags in our own tables and not rely on the RCS itself, at least not completely.


dww's picture

re: #1: please run cvs -H tag from your shell. ;) CVS can rename and delete tags, too. the d.o CVS repository just outlaws the practice in most cases.

re: #2: CVS has the same problem for different reasons. and it's equivalent to the problem that a given tag can be in many branches, too. that's why the xcvs scripts can not force people to add their DRUPAL-5--2-3 tag to the DRUPAL-5--2 branch. here's an example:

  • README.txt checked into HEAD, revision 1.1
  • README.txt modified, revision 1.2
  • DRUPAL-5 branch created (README.txt still at rev 1.2)
  • other code modified on DRUPAL-5 branch
  • DRUPAL-5--1-0 tag added to a workspace checked out from DRUPAL-5 (README.txt still at rev 1.2)
  • code in HEAD ported to D6 API
  • DRUPAL-6--1-0-BETA tag added to HEAD (README.txt still at rev 1.2)
  • DRUPAL-6 branch created (README.txt still at rev 1.2)
  • code modified in DRUPAL-6
  • DRUPAL-6--1-0 tag added to a workspace checked out from DRUPAL-6 (README.txt still at rev 1.2)

so, README.txt revision 1.2 is in a bunch of tags and branches, but it was a single commit. so, there's no way to say "this commit belongs to that branch".

(note: i had a typo in here originally. the not above was left out accidentally).

See Dries's blog about git.

Unique id for a commit, even in CVS

dww's picture

It's true that CVS doesn't provide a unique ID for each commit natively. However, the cvs.module does this, as the commit id in the DB tables. We need such an ID to be able to JOIN various things, and this is what allows unique URLs for specific commits, e.g.

So, just because CVS doesn't provide such an ID doesn't mean we can't have one in the API. ;) For backends that don't natively provide these, we just use our own autoincrement DB column, instead.

This big question is if every VCS that does this already does it as a simple integer, or a complex string. It'd certainly be nice from the API and schema perspective if this id was always an int.

Representing commit identifiers

jpetso's picture

It's true that CVS doesn't provide a unique ID for each commit natively. However, the cvs.module does this, as the commit id in the DB tables. We need such an ID to be able to JOIN various things, and this is what allows unique URLs for specific commits, e.g.

I know that. The plan is to have an integer commit_id for all backends, regardless of their internal revision identifier representation. They can, however, add the native identifier as 'revision' column to the {rcs_commits} table, optionally. For backends that don't natively provide repository-wide revision identifiers (CVS), the 'revision' value will simply be NULL. The API will provide ways to access commits by either internal commit id or native revision string.

Following the kdevelop-devel mailinglist where the creation of a VCS front-end interface was discussed, I'm reasonably confident that we must not assume any fixed or even sequential style of the global revision identifier. KDevelop's original draft of the interface included various different formats for the identifier, but it became clear pretty quickly that it doesn't make sense to let any code other than the backend's interpret the revision identifier. Subversion with its consecutive revision numbers is very handy here, but this is just a special case which does not extend to the majority of other version control systems. To be generally applicable, it just has to be a plain, dumb string.

Re: Unique id for a commit, even in CVS

jnareb's picture

This big question is if every VCS that does this already does it as a simple integer, or a complex string. It'd certainly be nice from the API and schema perspective if this id was always an int.

To have commit / revision id always an integer (like in Subversion) you need centralized server which would assign numbers to revisions. In distributed development some revisions may appear earlier at some repositories (some clones of repositories), while other appear later. Using SHA-1 of a commit has the advantage that it can be computed independently.

Well, you can represent SHA-1 as a fixed width 40-chars hexadecimal string, or as an 160-bit / 40-byte integer ;-)

Jakub Narebski

Jakub Narebski

One of the biggest things

gordon's picture

One of the biggest things that is over looked in git is the attribution of changes.

ATM when we submit patches to core, and it is committed, the author with respect to CVS is the committer. So if Dries committs a patch to core, he is the author and you the real author of the patch.

Where as with git the real author is listed as the author and Dries is listed as the committer. This all comes out of the SCO law suit so that every peice of code can be attributed to the person who actually did it.

Also commits can be signed as well by other people. So if I was to do a major change to the formapi, chx could sign that it has been checked by him, and it would give more weight in getting this though.

One of the misconceptions is that the we could have more people with core access, but in realitiy we would need less. The maintainers of the subsystems would have their own repositories where they would commit all the changes that they have destine for core. But people can pull these changes and merge them into there own repository if they need to, which results in code being fleshed out better. Then the core maintainers can pull the changes from the sub system maintainers repositories into core when they are ready. Keeping all the commit history and change complely in tack.

Gordon Heydon

Gordon Heydon