Diving into SoC

Posted by jpetso on May 29, 2007 at 6:26pm

So that's that. After having used the past two months for improving filefield and imagefield, it's about time that I get started with my RCS abstraction project. aclight has been as kind as to publish his work and findings in time with the Summer of Code, so that's a good starting point already. His work on the xcvs scripts (...bootstrapping Drupal, yay) benefits my project a lot, and the issues with password storage are good to know of.

In order to come up with a set of good questions that I'll need to research, I had a first shot on the database schema and the overall idea of how this thing should work. I thought it would be a good idea to see what we want to achieve and what we can rely on, before figuring out how to abstract all the stuff that's supposed to be different among the various revision control systems. Bullet points galore.

Goals for the RCS API

Make project* independent from CVS, by creating an intermediate API.
Make the intermediate API independent from project*, so that other modules (right, Case Tracker) can use the RCS API as well. Like, if I'm abstracting everything anyway, why not go all the way.
Share as much code and data as possible, but not more.
We only want to support for RCS that make it possible to do branching and tagging in some way.

Given these constraints, what I want to do is to create two projects:

"Revision Control" (short name: "rcs") for the general API, the HTML/RSS commit listings and the repository settings. Reasoning for the module name: API is left out of the name because it's no pure API module, and "Revision Control" makes for a better module name than the more cryptic abbreviation. And secondly, ...
"Project RCS integration" (short name: "project_rcs") for managing project/maintainer/repository relations.

To my mentors: is this breakdown of the modules ok for you? If so, I could just go on and create the projects this way.

Stuff that we can rely on

Each RCS that we want to support uses revision identifier for tagging files/directories and/or commits.
Each RCS that we want to support uses commits which can contain one or more file/directory operations (modification, copy, move, delete).
Each RCS that we want to support can tell us every little detail about a repository's revision history, given the repository itself and a since-when identifier (which might be a date or a revision identifier).
Each change to a file is contained in a commit, and filename + revision identifier is a unique identifier for a file. Which also means that given a filename and one or two revision identifiers, we can always build a diff or file-view url, so those can be kept in the central database storage.
Each file has a predecessor, except if it has been newly created.
Each project is only part of one repository. If a project should be accessible by multiple revision control systems, the admin is expected to use an RCS adapter instead, like for example git-svn.

Stuff that we cannot rely on

We can't expect commits being identified with a single revision identifier, because CVS doesn't do this.
Files can't be uniquely identified by their file names, because they can be renamed, or a file can be copied to the previous location of a deleted file.
Authentication and user account storage can be vastly different.
We can make no assumptions on how branches and tags are stored. In CVS, one single file can exist in multiple branches and tags, while in SVN, multiple branches and tags are copies of a file or directory, lying around in completely different directories and only connected through their common revision history.
I'm not sure if timestamps can be trusted to be unique revision identifiers. For CVS, there's no other possibility than using timestamps for retrieving all changes since the last update, but I don't know if this is a proper approach for revision control systems that support atomic commits. So I'd like to put the 'updated' column from the repositories table into the specific backends instead of the central database storage.

Results

Based on these assumptions, I tried to split up the database schema from cvs.install into different .install files, one for each of rcs, project_rcs and cvs. (Added to this posting as attachments.) I consider those a minimal starting point, if I got it approximately right then it shouldn't be necessary to take more things out. Changelog:

{rcs_commits} is an unmodified copy of {cvs_commits}, and goes into rcs.install. As CVS doesn't allow whole commits to be identified with a revision identifier, those are expected to be present only in the file revisions table.
A slightly shortened version of {cvs_repositories} exists as {rcs_repositories}, the few items that have been left out are subject to the respective RCS backends. ('modules' is arguably a CVS-only setting, while 'updated' makes for the best possible global revision indicator for CVS, but not for revision control systems supporting atomic commits. I'm not sure if 'method' wouldn't be popular enough to warrant inclusion despite being an RCS-specific setting.) Apart from that, {rcs_repositories} contains a new column called 'rcs', which specifies the backend that should be queried for more information.
{project_rcs_projects} and {project_rcs_maintainers} have been carried over to project_rcs.install without changes to their predecessors {cvs_projects} and {cvs_project_maintainers}. They're specific to project.module and therefore don't really belong into the general RCS module.
{cvs_accounts}, like supposedly all authentication mechanisms, stays in cvs.install.
{cvs_repositories} acts as extension for {rcs_repositories}, containing data that is CVS specific.
{cvs_files} and {cvs_tags} have been left out for now, as I'm still unsure on how to handle files and tags/branches. At the moment, the plan for {xxx_files} is that this is handled by the specific backend, but I guess it's a good idea to flesh out the API itself before deciding this. As for branches and tags, I'll have yet to come up with a plan, I think this should be centralized.

Questions

Naturally, many issues are not yet resolved. Here are the ones that came to my mind:

Just to make sure: Am I right that we always want to access revision info without querying the RCS itself?
Need a good solution on how to handle branches and tags. {cvs_tags} is not the ideal solution, given that branch != tag in most RCS. Neither can be attached to the commit itself, because it can potentially span multiple branches, including HEAD/trunk/whatever. (Which is not clean, but possible.)
Should {xxx_files} be centralized? If so, how do we represent renames/copies, and especially deletions (in a way that doesn't greatly cut database performance by requiring loads of joins)?
Is any extra care needed for directories? I think we can just treat them as files, like is standard on UNIX, but it's certainly possible that I'm missing something.
Need to research on how git works server-side. I haven't yet looked into this .git file stuff (which is the checkout location for git repositories), so it might be possible that some assumptions about the repository root location or at least the column name need to be thought over.

Upcoming work

So, here's for my current status. What's next is to find answers to the questions above, install a few revision control systems on my computer (as a server), get to know how all of them work, and then come up with a first API draft. So much for the first one of those weekly reports.

Attachment	Size
rcs.install.txt	1.27 KB
project_rcs.install.txt	938 bytes
cvs.install.txt	974 bytes

Comments

Oh, I forgot

Posted by jpetso on May 29, 2007 at 6:57pm

As for how the work is divided between Revision Control and the backends, I had imagined something like this:

Backend modules register themselves (and their capabilities, if necessary) by implementing hook_rcs_backend():

<?php function cvs_rcs_backend() { return array( 'cvs' => array( 'name' => 'CVS', 'description' => t('CVS (Concurrent Versions System) is a code management system used by developers to collaborate and track modifications of code.'), [...extensible for other info...] ), ); } ?>
The Revision Control module manages repository settings, providing one "Add [RCS] repository" for each registered RCS. Backends needing more repository settings can add more of them using hook_form_alter() on the respective form. Similarly, commit listings will be a barebone output, to be extended by appropriate hooks. Hooks for loading and saving more data into commits and repository objects will be provided.
Account settings and other authentication is completely left to the backends. They should also provide the administration forms for that.

Tell me in case I'm wrong on anything I mentioned.

More on the modules' scopes

Posted by jpetso on May 30, 2007 at 3:19pm

Talking with AjK on IRC, we came to the conclusion that it might be best not to include the commit listings in the central "Revision Control" module, which makes it possible to call it "Revision Control API" (short name: still "rcs"), and that's really what it tries to be. So the commit listings will land in an own small module, say, "Revision Log", which depends on the Revision Control API. Doesn't touch the database representation, or course. The rest of the module division looks ok to him, so I might be creating the corresponding projects soon.

Also, it was decided that it might be a good idea to leave the possibility open for backend modules to query the RCS directly. For example, in a way that the Subversion backend doesn't need to store its {svn_files} table (although it still might, that's an implementation detail) and that it could query svnlook directly instead.

The latter decision pretty much makes the {xxx_files} question obsolete, as now it might not exist at all in specific backends. Note that performance reasons (which I know little of) could make it necessary to keep the files table nevertheless, only we won't rely on it being there. So there's only the tags/branches handling left as major discussion point for the database schema. Yay!

Correction

Posted by jpetso on May 30, 2007 at 3:23pm

So the commit listings will land in an own small module, say, "Revision Log",
which depends on the Revision Control API. Doesn't touch the database
representation, or course.

Not entirely true: the "diff", "file view" and "issue tracker" urls will probably vanish from {rcs_repositories}, and move to {rcslog_repositories} instead.

a few ideas

Posted by dww on June 5, 2007 at 4:59pm

1) it's not immediately clear to me that we want to keep the "Revision log" viewer separate from the rcs_api module itself. what's the motivation for that? are we imagining a case where someone wants rcs_api and not the commit viewer? seems like a strange case to optimize for. sure, we can put the code in its own .inc file to only load it when we need it, but a whole separate module seems like overkill. setting up project* and some revision control is complicated enough as it is. while modularity and abstraction is a nice goal, we need to balance that with usability...

2) i'd need to think about it more, but i'm not 100% thrilled with all of the account/authentication stuff being pushed all the way down to the backends. a bunch of the code in cvs.module is for either a) the per-project RCS access control stuff or b) the "CVS account" form and related operations. i'm assuming that while the details of the authentication for each repository might be different, that most of the code could in fact be shared and managed at the rcs_api level, not in the backend. e.g. instead of a "CVS account" form, we'd have a "Revision control account" form, and there'd be a series of checkboxes for each repository on the system that's configured to allow public accounts, and each repo could be from a different RCS. so, you just have a single form, a single motivation, a single account status, etc, but a given account might have access to N different repositories, with potentially N different passwords, and usernames. perhaps we can/should assume you have the same username on all repos, but that might be an unnecessary restriction. but, point being: i'd like to keep as much of this stuff in the rcs_api layer, not in the backends, to make the backends as small and easy to implement and maintain as possible.

3) overall, i'm happy with the organization of the separate modules, though i wonder if project_rcs couldn't just be a project_rcs.inc inside the main project.module? if it's really specific to project.module, it probably shouldn't be a separate module at all. if it's a separate module since in theory you'd want to apply the same code to the projects from casetracker or something, then project_rcs probably isn't a good name for it (and, if it's share-able, it probably just belongs back in rcs_api itself).

this is really exciting stuff. i can't wait to see it develop.

re: a few ideas (1)

Posted by jpetso on June 6, 2007 at 6:57am

1) That was the outcome of talking to AjK on IRC. I might explain it in detail here, but then I think the (distilled) logs are more insightful:

<jpetso> what's your opinion on the proposed module division and naming?
<ajk^> that looked very reasonable. There will have to be a split somewhere and that seems like a sensible place. As for naming, trickie dropping the "API" from that module as that's really what's trying to be
<jpetso> yes, but i can't name a module "API" if it comes with administration and listings pages
<jpetso> that wouldn't be right
<jpetso> one thing that i could do is to have another split of the Revision Control, into an "API" and a "UI" module
<ajk^> well, an API could contain configuration elements. But listing pages? that takes it away from being an API true.
<ajk^> splitting out the UI is not a bad idea actually, the Views module does this to cut down overhead after config is done
<jpetso> on the other hand, Views still provides the views themselves even after disabling the UI module
<jpetso> and while we may get rid of the admin form, we still need the commit listings
<ajk^> true, but thise Views are "static" in that once configured they just repeat the same rules
<ajk^> well, if you look at the "tree" now you have project module, cvslog module sat at the same level. So a three way split isn't that big a departure from what's there now
<ajk^> ur just replacing cvslog.module with a new listings module
<jpetso> sounds sensible
<ajk^> rsclog.module ;)
<ajk^> for just extracting logged info from teh API
<jpetso> so, keeping the repository settings inside the API module, and putting the listings in its own, right?
<jpetso> yes, i kind of like it this way.
<ajk^> that sounds like a good route to me and will probably make ur life a bit easier breaking it up like that
<jpetso> then that's fixed.
<jpetso> cool.
<ajk^> good stuf :)

that sounds like a good idea

Posted by shiveringweb on June 6, 2007 at 12:56pm

splitting module into UI and API can come in very handy in the future in case any other project/module wants to make use of the API and not the UI for one reason or another...

re: a few ideas (2+3)

Posted by jpetso on June 6, 2007 at 8:16am

2)
I had planned the per-project RCS access control stuff to be managed by project_rcs which would provide hooks that can be implemented by the backends. So that if project_rcs needs an additional account or more permissions for an existing one, it would call on the specific backend module. The motivation for this was that not all backends might be able to handle automatic accounts & permissions stuff, so we'd make it optional.

But yeah, I think we need a finer-grained capabilities system anyways, so it might just be a good idea to push account creation and account management hooks back down to the RCS API. Makes a whole lot of sense. (I imagine capabilities like "create repositories", "list files", "account management" and maybe "branch/tag restrictions". Maybe we shouldn't even assume "list commits" and make that an optional capability as well?)

Either way, I now think this is doable and should be done, at least for the API and the "Revision control accounts" form. Provided that not all of the backends need to provide it, at least not from the start. (Makes for a great motivation on backend authors: providing more capabilities step by step, not all at once.)

How the motivation form is done really depends on the use case. For example, at BerliOS (I guess SourceForge as well) you have to provide a motivation for each project you start, together with loads of other information, whereas on drupal.org you get one global account with a simple motivation form and after that you can create projects as needed. I think we should make this pluggable. Which is, of course, easier if the API itself provides convenient functions for managing accounts. I'll look into that later on.

if it's a separate module since in theory you'd want to apply the same code to the projects from casetracker or something, then project_rcs probably isn't a good name for it

Great idea! To this point, I had just thought how to factor this out, but generalizing is certainly a much better idea. We could just provide an additional option in the content type settings and mark the content type as "project with RCS integration", which in turn would immediately enable Project, Case Tracker or any custom node type with account management :D

I like it much better this way than the project_rcs.inc. We could even keep the name, only that the "Project" part of it doesn't refer to the Project module, but to any kind of project.

(and, if it's share-able, it probably just belongs back in rcs_api itself).

I disagree. This has quite a lot of user interface involved, and requires Drupal to be used in a very specific kind of way which is probably not desired in all kinds of use cases. I'd like the API just to provide facilities, but not make use of them yet, this should be the responsibility of integration modules.

one feature to keep in mind

Posted by aclight on June 5, 2007 at 7:05pm

One feature/behavior to keep in mind while you're working on this is the fact that project_release switches from allowing the user to upload a file to create a release and not allowing this when the cvslog module is enabled. In my eyes, this is a less than ideal situation, because I don't think all project administrators would want that behavior. For example, with the site I'm building, I suspect that version control will be too complicated for most of my users, but I want to make it an option for advanced users. Therefore, I've hacked a line or two in project_release.module that only removes the ability to upload a file as a release if the project that release is being created under has a repository set for it. If no repository is set for a project, then the user creates the release by just uploading a file, the same as if cvslog weren't installed.

I haven't tested this thoroughly, but it seems to mostly work. However, I've noticed that the CVS tag popup box is still present on the project_release node when it is edited, and then it can't be re-submitted because CVS tag cannot be blank, yet there are no options because there is not a repository associated with the project.

For drupal.org, it makes sense to require all projects to use cvs. But, as the goal of this SoC project is to make the project* modules more flexible (presumably not just for d.o, but for other sites), I don't think that either requiring everyone or noone to use version control makes sense. Furthermore, if there will be the possibility to use various RCS systems with different projects, I think you have to go to per-project logic when determining whether project_releases are created by the manual upload of a file or by a RCS/packaging script.

I will try to remember to post my updated cvs.module once I finish removing the all-or-none link between having cvslog enabled and building releases from the cvs repository instead of manual upload.

let's move this discussion to the issue queue

Posted by dww on June 5, 2007 at 11:11pm

i'd like to clean up how this file uploads vs. RCS thing works, regardless of jpetso's efforts here. i agree that the current form_alter()'ing code in cvs.module is too ambitious, and should only remove the file upload if the project has a repository set. so, let's see an issue about this in the cvs.module's issue queue and we can continue the discussion there. thanks!