Localization server project integration database model

Events happening in the community are now at Drupal community events on www.drupal.org.
gábor hojtsy's picture

I am in the process of implementing the localization server project integration database model. When moving all translations to a web service, we need to track several things to be able to share strings as widely as possible (as Drupal itself does), while keeping track of each usage for reference, exporting and helping translators to asses the impact of how they translate some strings. So when a translator goes into translating "track", he will see that it is used in tracker module and audio module (a trivial example), so he knows that what he enters has an impact at least on these modules. For this to work, we need a fine grained database model, which I am implementing right now. Here is a graph of how it looks like:

Localization server project integration database plan

We have a list of projects. This should be retrieved from updates.drupal.org, but there is no such interface for it yet, so I am using some stub code to emulate it until we come up with an interface. The releases can only be queried by core compatibility at this time, so we need to store a (now serialized) list of core compatibility values for each project, which we use when we request projects with each compatibility value. Performance and http request wise it would be better for us, if we could skip the core compatibility part completely, as we always query all releases with all core compatibility values, to record any new or modified release. For now, I use the core compatibility based interface though.

Once we have all the releases of projects, we can go through the releases, grab their tarballs, and parse them each. These tarballs contain multiple files, which the extractor script already records just fine. Strings appear in these files, but a string can appear on multiple lines in a file. When a translator gets stuck about how to translate a particular string, we can point him to the exact files and lines the string is used (upon the translators request), and we will also export the file and line information to translation files as it is done now. So lines form a connection between files and strings.

This model allows sharing of strings between different projects, different releases, different files and even different lines. Unfortunately getting a list of all strings from all releases of a project will be a monster join, but that is how life goes. We will mostly need a list of strings for a given release, that is somewhat less work for the database server.

By the way, this is just the "project integration" part of the database model, strings, plural information, translation suggestions, translation team management all add their own tables, so the big picture will be more complicated. Anyway, as I have pointed out, we are working out the project integration issues right now, so I am focusing on this part.

Comments

While I like the general

gerhard killesreiter's picture

While I like the general idea, I think it wouldbe even more usefullto point translators to the URL on which the string will appear. I have no immediate idea on how to do that, though.

Yes

boris mann's picture

A kind of "sandbox" check out of Drupal, which allows you to see the string in context.

how would we do it?

gábor hojtsy's picture

How would we do it. How do we know on what path the string appears? When you see a t('action') somewhere in actions.inc (in Drupal 6 core), where does it appear? It is reused multiple times on different interfaces possibly, but we don't even know the 'original' or 'primary' location, where it appears.

On a site set up (with all Drupal modules and themes?? - who would set up such site with all the additional module requirements?) we can collect where strings appear, but then we only have a small subset of where they appear, as we don't know for sure that we have all the places covered. Projects such as ecommerce or views module only show the majority of their strings if you have stuff set up...

There is probably no real

gerhard killesreiter's picture

There is probably no real way to so this, unfortunately. What we could try is to set up a Dupal site, automatically add some content, enablel modules and then end a crawler to look for the string. We'd still miss a lot of them, I guess.

Data for all releases for all core versions

dww's picture

It should be trivial to make 2 small changes (both technically outside of project.module) to make it easy to get info on all releases for all versions of core:

1) change project-release-history.php to generate a [project_name]-all.xml file, with all the release history, not segregated by core compatibility.

2) change the project-release-serve-history.php script to support this via a path such as http://updates.drupal.org/release-history/[project_name]/all

Both scripts live in contributions/modules/project/release, and should be easy to modify for this.

Patches to the project issue queue would be most welcome. ;)

Cheers,
-Derek

p.s. That still leaves the separate question of a simple .xml list of all projects. However, this, too, should hopefully be a relatively simple change to these two scripts. However, please handle that via a separate issue with a separate patch, so it's easy to review and test these unrelated requests independently, and get them committed and deployed without any unnecessary delays or dependencies on each other. Thanks.

performance questions

gábor hojtsy's picture

I have been thinking about performance implications, and how to do this speedy and with less HTTP requests. Basically localization server need basic data of the projects and releases. We only need the project node 'uri' and 'title' attributes, and some release information (title, timestamp, download link). Other data on the image are only for internal use, not coming from outside sources (we don't need core compatibility as described earlier). So this is a really small set of information we need to know about all projects and releases.

There are two phases of syncronization that could be done here:

  • initial data sharing
  • incremental updates

The initial data sharing is big, while the incremental updates are smaller. With the currently discussed model, the two models are merged, and we always request the same amount of data as the initial data sharing would require (all releases for a project). We can make this a bit more efficient with "caching", ie. we remember the latest timestamp of all the release timestamps, and construct a request with an if-modified-since, which the server could adhere to. There the only "problem" is that if most projects have .x-dev versions released (which always move as new traballs are available), this date would not have any use).

Anyway, I am just brainstorming about possible problems with continually updating from updates.drupal.org, but with update module getting into Drupal 6, it should be a strong backend, and the single localization server instance running could possibly be just a drop in the ocean.

I will look into implementing the currently discussed modifications with issues submitted against project module.

drop in the ocean

dww's picture

This will totally be a drop in the ocean. ;) The last thing I'd spend any time worrying about is performance on the backend.

FYI, the "server" is now an ultra-thin php wrapper, not a full-blown drupal.org menu callback:
http://drupal.org/node/155281#comment-267382
http://drupal.org/cvs?commit=71890

I'd just fetch full data, however often you need it. The more expensive operation will be fetching tarballs or checking out stuff from CVS. However, in this case, the .xml release history files can help since they include md5 hashes of the tarballs, so you can always remember the md5hash of the last one you fetched/processed, and if the release history file says it's the same, no need to refetch. This will even work for the dev snapshots.

Cheers,
-Derek

first issue submitted

gábor hojtsy's picture

I started with the project list interface, issue with patch submitted: http://drupal.org/node/157514

SoC 2007

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week