Issues to consider for multiple project branches

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
haxney's picture

One part of my GSoC is to add support (of some sort) to VersionControl API for having multiple branches within a project (like a drupal.org project) with different user permissions. The idea is to allow a more DVCS-like workflow, where people can work independently on their own branch and then request an admin user to pull their changes into the master repository. I won't be starting this for a few weeks, but I wanted to get a discussion going about what it should look like.

The goal

The ultimate goal would be to allow unprivileged users, those who nowadays wouldn't have any sort of commit access for a project, to be able to commit code with the same tools and environment as privileged developers. This would create a much lower barrier for entry for contributing code, and would allow new developers to more easily provide their expertise to a project.

The two main options

There are two main ways to go about a branching-based workflow:

Multiple Repositories

Allow non-admin users to create their own repository, to which only they (and possibly users they authorize) have write access, but everyone has read access. This is the way that GitHub does it, and it seems to work pretty well there. They have included within their site an interface for branching from a repository and submitting a merge request.

Per-User Branches

Allow non-admin users to create branches within the main repository, but restrict their write access to exclude certain branch or tag names. This would cut down on the number of repositories created, but would require some more complex logic in the access checking functions.

Advantages and disadvantages

It is not yet clear which approach is better, and the choice depends largely on the selection of VCS, since different systems make one or the other option easier or harder.

Advantages of Multiple Repositories

  • Permission-checking logic would be simpler. There would not be the need to check whether a particular user had access to a particular branch, rather, users would have full access to their own repository, and the existing restrictions would apply to the main repository.

  • More freedom within a forked repository. Most DVCSs are fairly agnostic to the particular name of the branch on which development is done, but it is nice to have the freedom to create branches and tags of any desired name and push them to the remote repository without restriction. Rather than having to work on a branch called chrono325 (for example), a user could create their own master or trunk branch, and work on that.

  • More than one user branch. Related to the previous point, no extra though would have to be devoted to how many branches a user would need and what to call them, since the upstream developer would not be affected by the unprivileged user's choice of branch name. With per-user branches, if a user wanted multiple branches on the same project, they would have to be called something like chrono325, chrono325-1, chrono325-topic and so on. With multiple repositories, there is no need to worry about that.

  • More "first-class" feel for unprivileged developers. Rather than having a restricted set of branches within the upstream repository, the unprivileged user would have full control over their own repository. This is mostly a psychological difference, but could create an atmosphere of greater inclusiveness and a more level field between admins and unprivileged users.

Disadvantages of Multiple Repositories

  • Inapplicable to centralized VCSs. CVS and Subversion do not have any means to perform cross-repository merges, so they would not be able to make use of this option at all.

  • More drastic change to project* schema. As far as I know (and I haven't yet checked), the VersionControl-project integration makes the assumption that each project has exactly one repository. Depending on how deeply-ingrained this assumption is, it may be difficult to add this support to versioncontrol_project.

  • Potentially higher disk usage. Not all VCSs have an efficient way to set up multiple repositories with mostly-similar contents. Git does, but I don't know about the others. If a VCS does not support this, then the disk space would increase linearly with the number of repositories, rather than with the number of differences between revisions. This would be a big problem for the drupal.org servers, since the disk usage could quickly balloon.

  • Need to handle repository access control for additional repositories. Many of the access controls rely on writing configuration files which are used by the VCS to handle authentication. More repositories would increase the number of entries in those files.

Advantages of Per-User Branches

  • This is probably the only viable solution in centralized VCSs like Subversion, since it was not designed to operate on multiple different repositories, and does not include support (as far as I know) for merging a branch from one repository into another.

  • Cuts down on the number of repositories created. The overhead for a bare repository may not be that high, but if the VCS does not have an efficient way of sharing common objects across repositories, then having a repository for each user would quickly use up a great deal of disk space.

  • Simpler to (re)view branches with existing tools. Especially in DVCSs, there are good tools to view the differences between multiple branches within the same repository. The support may not be as good when the branches are not within the same repository. Git has "remote tracking branches" which accomplish this easily, but I don't know about other systems.

    Also, it would be readily apparent which branches are unmerged, since all of the unprivileged branches would be viewable to anyone who downloaded the repository.

  • May be easier to add to project* modules. I haven't looked into this closely yet, but there is already a mature system for dealing with multiple branches within a repository, but there aren't yet tools for associating multiple repositories with a single project, so it would likely be simpler to add additional permission-checking code than associating multiple repositories with a single project. Again, I have not yet looked at the project* modules, so I don't know how invasive this would be.

Disadvantages of Per-User Branches

  • Namespace of main repository gets polluted. The main repository would have a larger number of branches within it, which could make it more difficult for the admin to filter the signal from the noise.

  • Increased complexity of permission checks. Permission checks would be more complicated, since the permissions for a particular branch and user would need to be checked.

  • Multiple branches per user. If a single user wanted to have multiple branches, they might have to name them something like chrono325-1, chrono325-topic and so on. This would further complicate the logic of branch permission checking. It could also present problems for users with special characters in their names, especially if the username contains characters not permitted in branch names for a particular VCS.

Making a decision

There doesn't appear to be a solution which is uniformly better, especially since the multiple repositories option doesn't work for centralized VCSs. Probably the best solution is to implement both and let the site administrator choose.

There is also the issue of deciding what to do for drupal.org, which depends on the choice of VCS.

Anything else?

I tried to include everything I could think of, but obviously there are more issues to consider.

Comments

Disk space abuse issues

haxney's picture

I just thought of another thing. It is a fairly unlikely scenario, but might be relevant for sites with a more anonymous user base.

Another disadvantage of Per-User Branches is that if a user maliciously or accidentally pushes a large amount of information to the repository, then any user cloning that repository would also be stuck with that info by default. It would be non-trivial (and possibly a major headache) to remove the branch with the "crap" on it without damaging the rest of the repository. VCSs rather intentionally make it difficult to fully delete information, but this means that if a user pushes a 3GB file to their unprivileged branch, it would be difficult to remove it both from the server and everyone's local copies.

With multiple repositories, if you did not want to download their work, you would simply avoid pulling from their repository. If a user's repository truly turned out to be abusive, it could simply be deleted without fear of causing other users to lose data. I can illustrate this with more a more concrete example if necessary, since it is not immediately obvious what is going on in this case.

Anyway, something else to think about.

Disk space; schema changes

jpetso's picture

(Disadvantages of Multiple Repositories:) Potentially higher disk usage.

According to killes (d.o admin) and DamZ (Damien Tournoud, high-profile dev), disk space on the repository server won't be an issue. For one, it's relatively unlikely that contrib will switch to Git anyways - SVN is more likely for that - and also, quote killes: "if we should agree that this is a good idea, then the Drupal Association will find means to get the necessary disk space". So there's little reason to worry about disk space, of all server resource it's probably the one that can be upgraded most easily.

In general, I think per-user repositories make a lot of sense for distributed version control systems, and topic branches are very suitable for centralized ones. So yeah, we want to support both, but if necessary then we can focus on these two use cases and pretty much disregard multiple repositories for centralized VCS and single-repository topic branches for DVCS.

As Drupal core is likely to switch to a DVCS and contrib will most probably stay centralized, it's also important that a website can choose between those different methods on a per-project basis rather than with a global option.

(Disadvantages of Multiple Repositories:) More drastic change to project* schema. As far as I know (and I haven't yet checked), the VersionControl-project integration makes the assumption that each project has exactly one repository. Depending on how deeply-ingrained this assumption is, it may be difficult to add this support to versioncontrol_project.

project.module & Co. don't deal with version control integration at all, so there's nothing to change there. The Version Control / Project Node integration module does indeed assume a single repository per project, getting rid of this assumption is the central part of your SoC task at hand. It will require quite a bit of changes, but it's doable. It has improved up to the point where Commit Log filters only by the project's nid and doesn't need repo/directory to determine the operation/project association. We can go from there, it's really not that bad.

Right idea, wrong approach

aidanlis's picture

What you are suggesting (a distributed version control system) sounds great, but then why not use a distributed version control system instead of trying to build this functionality on top of SVN?

I don't think anyone should work on this issue until you've actually used a DCVS like Git or Mercurial. There's a reason everyone is raving about them.

For example, Gallery3 (the leading photo gallery software) has just swapped to Git. The official maintainers have access to the vendor branch, http://github.com/gallery/gallery3/

When people want to contribute, they clone the repository (git clone git://github.com/gallery/gallery3/), make the changes they want, then push them back (git push origin master). The official team can either accept or reject the patches as they see fit.

If there's a significant change, such as adding pgsql support (the official team only supports mysql), then rather than pushing your changes back to master, you can simply maintain your own tree (http://github.com/rledisez/gallery3/network). In this case, the official team can easily pull (git pull) or cherry-pick (git cherry-pick) changes from rledisez's tree (which they do, often).

There's no reason to build a massive API which tracks which user has access to which branch. Let anyone make a branch if they want. If they are doing good things in their branch, let the official team cherry-pick the changes.

Issue tracking and software releases

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week