There was a quote of someone saying that inexperienced programmers worry about code, while advanced programmers worry about data structures. Don't remember who said that, but personally I find this an excellent analysis.
Let's try to formulate the important questions that need to be answered in order to get the Revision Control API right:
- What kinds of functionality do users need from the API?
- What kinds of functionality do different revision control systems provide?
- How does the data that is delivered to the caller look like?
- Which functions are required to deliver the data to the caller?
Right, I have not yet written up the pieces about CVS and SVN, and I did not yet examine Mercurial and Bazaar, all of which is planned and still to be done. So 2. is not really finished yet. Nevertheless I felt a pressing need to write down the data structures that I had imagined, so that there's something concrete to work with.
The big idea is to have a structured array called $commit, and populate it with data as needed. The very basic commit information is stored in the central {rcs_commits} table, and holds enough information for the backends to get any other data about the commit. Naturally, you'd be able to combine multiple commit arrays in a container array that goes by the name $commits, and that one would probably constitute the primary data structure that is passed around in API functions. As for additional data, I think of three things we additionally need from the backends:
- Commit info, like, "which files and directories were changed in this commit, and what did happen to them". Everyone needs this; most prominently, revision logs like drupal.org/cvs.
- A list of files associated to a revision or a point in time. Like, "as of the global revision X, these files exist in that given location, and they were last modified in per-file revision Y". This comes in handy for browsing repositories, like on cvs.drupal.org.
- And finally, custom information added by backends. Like, "this commit occurred inside the CVS module 'contributions', but the RCS API doesn't know about those so let's add this info in the CVS backend, in case someone wants to know".
The last point clearly does not need to be standardized, you could even question if it's necessary at all. (Personally, I think it's a good idea.) However, the first two ones require standardized data structures, and I tried to figure out how those could look like.
I wanted to put this somewhere visible and version controlled, so I created an example backend module named fakercs_backend and added that module to the Revision Control API as backend developer documentation. Here's the current version of it - I expect it to change as the API becomes more concrete, and in the end it should cover all functions that backends can implement, together with inline documetation and return value examples. In the meantime, let it be the whipping boy for API construction purposes :-]
This current version reflects pretty well which data structures I had imagined, and if I didn't miss anything then it should be applicable to CVS, SVN and git at least, probably more. Function names and signatures on the other hand are not bad imho, but pretty much boilerplate and have just been added for structuring the other stuff in there. I'd not be surprised it they changed drastically. Still missing is a good plan on how to represent tags and branches, but I guess you're used to that by now... seems to me that this will be the hardest part to figure out in the upcoming days or maybe weeks. But I'll get to it soon. (First, let me write up all those revision control systems, now that the data structures are written down.)
I guess you could label this post as another weekly report. Now that there is actual content to look through, I'd totally appreciate your input. Even more so if you know a lot about a specific RCS, because then you could confirm or deny that the respective backend could provide such info, and whether there's obstacles in doing so.

Comments
More data for repository browsing
Today, two other forms of data came to my mind, both of which are very useful when writing a repository browser:
I need some more time to think of a sensible solution for integrating those into the data structure, but we will have them.
Last one
That would be 6 types of commit data, and I think that's it.
We shouldn't re-implement the version control systems
After reading this, especially the follow-up comments, my primary concern is that we don't want this effort to be re-implementing the version control systems themselves. this is all about integration, not duplication. Furthermore, I don't think we want to re-implement repository browsers for each system (or a generic one). even after all this code is solid and committed and installed on d.o, i still see us running ViewCVS on cvs.drupal.org.
So, for example:
Contents of a given file. Like, "cat abc.txt".
That's an operation we could consider allowing backends to implement a hook for, but this doesn't belong in a discussion about the "data" we're storing. This is a rare-enough occurance that we should just ask each VCS to "cat abc.txt" for a given revision, tag, or branch identifier.
Annotations
Even more so. It would become massively complicated and expensive to try to provide "blame annotated source" views for every possible version control system. :( That could be a whole SoC project in itself. Instead of your proposed array mapping every line of every file for each project on an entire site to the uid that last changed it, what the version control API needs to be able to map is the Drupal uid with any given version control uid. So, within the Drupal interface, there are easy ways to show:
All of these are very specific to the Drupal interface, and therefore, need tight integration via this API and its backend modules. Furthermore, none of these are possible with each VCS and its associated tools (repository viewers, etc) out of the box. To make these even more handy, they should provide auto-generated links to an external repository viewer for blame-annotated source, diffs, etc.
This last point is key. There's nothing Drupal-specific about the diff between 2 revisions, so it's not the job of a Drupal tool to provide that view. You could argue that annotations should be Drupal-specific, so that you can see the Drupal ids in the "to blame" column, instead of the per-vcs uid, but i think we should all be willing to live with the per-vcs uid in this case, since the data costs and code complexity of trying to provide our own Drupal-specific annotation views would be huge.
See what I mean? Our guiding principle should be to keep this API has simple as possible, but no simpler. ;) We should try to avoid duplicating data between our DB tables and the canonical data in the VCS itself. Of course, we'll have to duplicate some, but we should minimize it. In large part, we should think of this duplicated data as a cache for performance reasons (so we don't have to keep calling out to the external tools all the time). Anything that makes sense to offload to existing tools that are specific to each VCS, we should (i.e. don't reimplement ViewCVS, WebDAV + SVN, etc, etc). Anything that's highly specific to the Drupal UI that enables the kinds of functionality outlined above, we should figure out how to do it as simply as possible, without the performance hit of calling out to external VCS tools all the time.
I think you got me wrong
All of the data I mentioned are supposed to be provided by backend hooks. Whether the data is stored in the database or not is completely up to the backend module, I don't want to prescribe anything here. When I say "which data do we need" then I mean that this is likely to be data that frontend modules can make use of, not which data has to be stored in the database. The overwhelming majority of data mentioned here can all be retrieved on the fly from the RCS itself, with the exception of CVS not providing per-commit revision identifiers. For those which are not provided, there'll be the capabilities functionality that frontend modules are supposed to check on before using optional (for the backend) functions/hooks.
Also, I don't expect Drupal to switch away from ViewCVS, and I don't even plan on implementing a repository browser within this SoC. But I certainly want to make it possible with the API to provide a cross-VCS repository browser, and I know at least one company which would like to have such a thing. Of course, this would be an optional add-on module just like the project node integration has to be, and would be referenced with a view/diff/tracker/annotation URL just like all other repository browsers are referenced. And those URLs are probably a setting for the commit log module (which hooks into the API module's add/edit repository forms for this purpose), as the URLs are specific to how logs are presented in the commit listing.
See, I'm in full agreement here. Let the commit log module provide the links, and make it possible to either have ViewVC, WebSVN, Chora, a Drupal-module repo browser, or any other URL provide the view on this. You've got a point that retrieving Drupal uids could be very costly, I'll take those out of the hook and just leave the per-vcs uid in there. (One can always fetch them with an additional API function.)
All of those are project specific and belong into the project node integration module (apart from "all the commits that a given drupal user did, globally"). Given a general API function and processing its results into a more specific project node integration API function, we can get all of these. They're just not the "basic" functionality that anyone may need, but already very Drupalish, very Project related functions, and I didn't cover those in this post. I will get to these yet.
Ok. As I really don't want to prescribe an {xxx_files} table for backends, I guess we'll end up with a {commit_id, nid} cache here. In the project node integration module, of course.
One more thing that I agree with
Yes, definitely. That's undoubtedly essential for the API, and will get a function call.