Version Control API iterations

Posted by jpetso on July 14, 2007 at 9:20am

Well, I promised you an update on the API module, so that's what you'll get. Just a short notice beforehand: the CVS backend's database structure is now online and available for everyone's reviewing pleasure. Module functions will be added in a week or two, but before that, I'll get the xcvs-* scripts to fill in the entries for the new database structure. I've also made a (slightly different, but largely similar) SVN version of the .install file, you can find that one attached to this post. (As a Subversion backend module is only an optional deliverable, I'd rather not start a versioncontrol_svn project and its directory in CVS just yet.) So it seems to me that deliverable number three, a finished database schema for the CVS backend, can be marked as done.

Get to the point, dude!

Ok, ok. My remaining problem with the Version Control API was that it lacked a way to consistently deal with tags and branches, and to uniquely identify files and directories even if they are not versioned themselves (which is the case for directories in CVS and other version control systems). So I thought a bit more about how to deal with this, and the outcome is a new building block object for items, in addition to the previously existing ones for repositories and commits.

In the context of the Version Control API, items are file-level contents of a repository, that is, files and directories, whether they are versioned or not. The arrival of item arrays simplifies the API quite a bit, as items are now consistently used in function parameters and return values. You call some function and hand its items over to the next one as you need it, and that's a definitive improvement over the mostly correct but differently shaped return values that were in there before the rework.

The art of identification

Together with item objects comes the question of what is needed in order to uniquely identify such an item. I had quite a hard time thinking about this, and how would be possible to wrap it into the API in a nice way. The difficulty here is that there are a few special cases: items that are not versioned, items for which a repository-wide revision is given that doesn't match any of the file-level ones, items that you are retrieving from a certain branch or tag which you want to keep for the next function call.

What is then necessary to uniquely identify the most simple item object that doesn't fall under one of these special cases? Here's what I came up with:

The repository.
The path of the item inside the repository.
And the file-level revision of the item, for example '1.2.2.7' in CVS or '127' in Subversion.

That's it. So an item array contains path, file-level revision, and item type that specifies if it's a file or a directory. The idea of the refactored API is to only pass around valid items that transparently encapsulate additional hidden info, like a tag or a global revision, without requiring the API user to know about all this stuff. For example, versioncontrol_get_item_branches() gets you a set of items in different branches, and passing one of these items (which is, say, a directory in the DRUPAL-5 branch) to versioncontrol_get_directory_contents() would automatically get you another set of child items from the same branch.

The same thing applies for view/diff URLs and the rest of the API. Which additional info is stored in an item object depends on the backend, so Subversion won't need to keep branches and tags in an item as there is no such thing in Subversion. (Internally, at least - it will nevertheless be possible to provide full tag/branch support by analyzing and transforming the item path.)

Upcoming work

I might just get the opportunity to continue my efforts so it seems appropriate to plan some more. My half-term survey for Google is filled in (and so is Andy's), the item stuff is figured out, and it's time to get my code to actually do some work now.

Supporting multiple branches in one project, especially if they can be maintained by different people than the actual project maintainer (yes, that's common in distributed development methodologies) is harder than I originally assumed. But as this would happen in the project node integration, it's also at the very top of the infrastructure hierarchy, so it can still be done afterwards without affecting the more low-level code. I talked to Andy yesterday, and he reminded me to stay down to earth and concentrate on getting the basic features to work, only afterwards worrying about optional extras.

And that's what I'll do. Luckily, that coincides with my schedule anyways, because that one says

Modify the xcvs-* scripts so that they store data in the newly defined database scheme.
Targeted for: 2007/07/19

Consequently, the plan is to

Fill the database with commit info!
Implement what's necessary to retrieve this commit info from the database!
and: Display a commit log with all the info that has been retrieved!

Sounds like a good plan, doesn't it? When that stuff works, I'll get back to the more advanced stuff like account management, project node integration, motivation text approval, Subversion backend, Drupal-based repository viewer, world domination, ...yah, you get the idea. Make the functionality work that is used by and needed for drupal.org as it exists now, and afterwards add a whole lot of nice features that you did not ever think about in your nasty CVS restriction cage. And have a solid base for all of that.

Right now, I love doing this. You may have noticed.

Update: Inspired by aclight's comment, there's a new version of both versioncontrol_svn.install (attached, now as "v2") and versioncontrol_cvs.install (in CVS) that makes it possible to get the data for versioncontrol_get_commit_actions() only by querying {versioncontrol_[cvs,svn]_item_revisions}. Read the comments section for further information.

Attachment	Size
versioncontrol_svn.install.v2.txt	2.35 KB

Comments

versioncontrol_svn.install question

Posted by aclight on July 14, 2007 at 12:48pm

It looks like this is coming along very well.

One quick question:

In your versioncontrol_svn.install file, you have:

      db_query("CREATE TABLE {versioncontrol_svn_file_revisions} (
        commit_id int unsigned NOT NULL default '0',
        filepath varchar(255) NOT NULL default '',
        revision varchar(32) NOT NULL default '0',
        action varchar(64) NOT NULL default '',
        PRIMARY KEY (commit_id, filepath)

I realize you say that this table might not be necessary if svnlook is fast enough, but I think it's probably a good idea to keep this table either way. But why do you need both commit_id and revision? Shouldn't you be able to just make revision an integer and make that the primary key for the table?

Sorry if you explained this already somewhere else--my excuse is that I fell asleep during class :)

Thanks
AC

Rationale

Posted by jpetso on July 15, 2007 at 9:13am

Sorry if you explained this already somewhere else--my excuse is that I fell asleep during class :)

Aww, are my updates so tiring? Sorry for that x-)

But why do you need both commit_id and revision? Shouldn't you be able to just make revision an integer and make that the primary key for the table?

First of all, I can't make revision the sole primary key as there may be multiple items per revision. That means that the primary key must either be (commit_id, path) or (revision, path). The reason for revision being a string is a combination of the following thoughts:

The item_revisions table is designed to be joined with {versioncontrol_commits}, so there needs to be a common element, preferably an indexed integer. The revisions column in the commits table is a string, so either we lose joinability by making {item_revisions}->revision an integer, or we unnecessarily lose performance (unindexed string joins) and the opportunity to find item revisions by commit id. I'd rather keep both and store a little more data instead.
The revision is only in the table for performance reasons, it could very well be retrieved by just joining with the commit table, with commit_id as common join element. On that thought, I asked myself if such joins need to occur at all for the common cases, and the answer is no.

With a few small changes (thanks for making me think about it), it's possible to retrieve all info that is required for assembling the result value of any get_commit_actions() and get_item_history() call, without the need to join with the commits table at all - for the SVN backend, at least. In particular, we need item type, path and revision, plus the unique identifier of the previous version of that item (as in SVN there is only one predecessor for any item, even for merges). That lead me to introducing a unique item id (integer) which identifies an item throughout all its renamings.

Which means that the primary key is now (commit_id, item_id) instead of (commit_id, path) in the SVN backend. CVS doesn't have renames, thus it can still have the (commit_id, path) as primary key and we still get all the required data.

So, the item_revisions tables in the SVN and CVS backends are a bit more verbose now, but it seems to me like a good tradeoff for better performance. Especially as at least the SVN item_revisions table is only there for caching anyways.

Thanks for your comment - revision is still not an int, but the rest is clearly better now :D

Fantastic!

Posted by mlncn on July 15, 2007 at 12:01pm

Hope you'll be able to make Subversion's backend!

Any Drupal shops looked at using both Case Tracker extended and Project manager with say a subversion backend for managing client projects with custom code? Ideas on integration?

~ ben melançon

member, Agaric Design Collective
http://AgaricDesign.com - "Open Source Web Development"

^{benjamin, agaric}

Case Tracker + Subversion

Posted by DaveNotik on July 16, 2007 at 10:02pm

Those are in fact two essential ingredients to what I'm pushing for.

I initiated the Case Tracker module as the first and most crucial piece in a distributed team collaboration tool. As software development is the most obvious first market (software development is often very distributed) adding revision control (Subversion) is the next piece. Each project workspace comes with case tracking and revision control -- already a powerful tool.

I'd like to work with others who are interested in pursuing that goal.

http://www.wovenlabs.com/projects

--D

--
http://www.wovenlabs.com

--
http://www.woven.org
http://www.davidnotik.com

Case Tracker integration

Posted by jpetso on July 17, 2007 at 8:39am

In fact, my employer both uses Case Tracker and Subversion for collaboration, so I'm indeed aware of the opportunity to integrate Case Tracker with version control systems :-] If it all goes well, project node integration will work for Case Tracker in a similar way that it works for project*. That's the plan, at least. Then we only need a way to integrate commit messages with case comments (close or update the bug with a special statement in the commit message) and Case Tracker is totally VCS enabled.

Who needs Trac when we've got Drupal? ;-)

Split effort

Posted by dww on July 23, 2007 at 4:38am

"Who needs Trac when we've got Drupal? ;-)"

Good question. However, the goal of a Drupal-ified replacement for Trac is going to be harder to reach if we're splitting our resources into multiple efforts. project* + versioncontrol* is already almost a replacement for Trac. But, it still needs more work and love. CaseTracker + versioncontrol* is even further from being a replacement for Trac. I just wish that instead of both "teams" working towards their own Trac replacement, we could pool our efforts and focus on 1 solution and actually make it kick ass. Since project* is what we use at d.o, I personally believe the entire Drupal community would benefit from making our One-True-Trac-Replacement effort based on project*.

I've made this argument before, and the usual chorus line goes "...but choices are good, and there's no harm in having more than one way to do it". And, of course, CaseTracker already exists and people use it, so there's no changing that. Furthermore, it would be stupid to write versioncontrol* in such a way that made CaseTracker integration harder, so I certainly support jpesto's efforts to make this all general enough to be useful to everyone.

I guess when CaseTracker was first being written, I hadn't started my crusade to make project* useful for other sites. Project_issue was in sad shape (it was still part of project at that time, in fact), so I know why CaseTracker was launched in the first place. But, a lot has changed in project_issue and project* since then, and lots of other changes are on the way. I think we could make project_issue just as light-weight and flexible as CT is in the near future.

I just really wish all the effort that went into CaseTracker was put into improving project_issue.module and making it slick, customizable, and flexible, instead. I don't know how to make that happen, now that the forked effort exists, but I wanted to mention my thoughts on it again.

Cheers,
-Derek

xcvs-* scripts

Posted by dww on July 23, 2007 at 4:24am

Before you do anything else with the xcvs-* scripts, please review and test http://drupal.org/node/136866. These scripts should just bootstrap Drupal so they can use the DB abstraction layer and directly call these various APIs, instead of manually manipulating the DB. I'd love to get that patch RTBC and committed in the near future. It's a requirement for PgSQL support for the cvs.module anyway, and it'd probably make life easier and cleaner all the way around. So, before you fork the xcvs-* scripts, let's get this other task done, first.

Thanks!
-Derek

Er...

Posted by jpetso on July 23, 2007 at 8:29am

I'm sorry... I already forked the loginfo and config files last week, including the bootstrap code. Perhaps it would nevertheless be a good thing to help with the original scripts. I'll review aclight's patch today or tomorrow.