The big content translation models debate (with video!)

Gábor Hojtsy's picture

Although not planned, the first Drupal 8 Multilingual IRC meeting did drive us into discussing content translation models for Drupal 8 in detail (see the meeting notes). To be able to cotinue that, I've specifically announced the second meeting that it will have a focus on that. And after a little update on the current status, we did return to that topic. Unfortunately many people who were there on the first meeting were not there this time and much more importantly, a lot of people who think we work on something they'll not need to think about are deeply mistaken and are better drawn in sooner than later.

Because this is easily one of the major pieces of the Drupal 8 Multilingual Initiative, I decided to take more time and produce a deeper explanation for the context. It ended up being a (mini) presentation which I've screencasted for discussion and so people can watch it anytime from anywhere. I think this screencast is good if you want to understand a bit about translatable fields in Drupal 7, the core translation module and their respective histories and fallacies. But the ultimate goal is to get you thinking (and then contributing) on solving the major tasks outlined. If you have better ideas those are equally valued. The planned concepts explained are a result of many discussions, in part in the second IRC meeting (of which the log you'll find attached).

Drupal 7/8 Content Translation models from Gábor Hojtsy on Vimeo.

We need your feedback and we need your involvement!

AttachmentSize
D8MI-meeting2-log.txt28.14 KB

Comments

field_entity translation

chriscalip's picture

field_entity translation rocks !! wohooo

Harder than it sounds

Crell's picture

"On paper it looks simple." - That's the key problem. :-) "Make this stuff translatable" is easy to say, but very hard to do. The current Field translatability is a nasty kludge. The API for writing a field or formatter or widget is almost incomprehensible as a result. If we make more things translatable, we need to do so in a way that doesn't make working with it as hard as pulling teeth.

I think the first step would be a generic language-sensitive entity relation mechanism. Let Noderef, Userref, Taxonomy ref, etc. share a common base (tip: We need OO here) that is translation sensitive. Just that would get us a lot further along, then we can see what if any other systems we need to add.

I overall prefer the unified object model as a base, as it is easier to work with for most of the use cases I've run into. It also seems like it's easier to "support by default" for other contrib modules than the linked-nodes model. Making it easy for other modules to "just kinda work" out of the box without lots of extra effort is important, otherwise all the work we go to will be for naught.

The current Field

yched's picture

The current Field translatability is a nasty kludge. The API for writing a field or formatter or widget is almost incomprehensible as a result

The code handling field translatability is indeed quite mind bending - but I'm not sure I see how that affects writing formatters and widgets, which simply receive values and deal with them, mostly not giving a damn about the language ?

Other than that, I think one way to simplify the handling of translatable (or not) fields and translatable (or not) properties would be to turn $entity->field_foo and $entity->property_bar into classed objects, encapsulating the access to values through a unified interface,

$entity->[field|property]->get($langcode = NULL) takes care of proper language resolution and fallback depending of whether the field/property is translatable, translated, etc. We don't want to have the painful $entity->field_foo['und'][0] syntax spill over to current properties. Entity API module experiments something like that, I think.
A unified interface for field and non-field values might also help blur the hard line for other features (widgets, formatters...)

Also, one impact of turning base properties like author, status, into translatable values, means they cannot hardly be stored on the base table anymore (because the # of languages and therefore possible values is not pre-determined). node.uid and node.status are currently used in many direct queries on the base table, so the performance impact might be non-minor.

Other than that, the idea of having some properties that can be made translatable sounds appealing, but finding a name to qualify them (as opposed to ->nid) might be fun - translatable-able ? :-)

Arrays

Crell's picture

Mostly it's because it makes the obscure and incomprehensible array structure of a field in a widget or formatter even more obscure and incomprehensible. I genuinely don't know how to write that nutty dereferencing code to get at the actual value without just copy-pasting it from somewhere else. It's just too obtuse.

I agree that making entities and fields into proper classed objects is likely the best solution. We may also want to see if the emerging PHPCR standard has a solution here we can leverage.

Entities as a proper class I

catch's picture

Entities as a proper class I think will happen for D8 - lots of people want it but it probably needs D8 to actually open up and start moving to get momentum on patches.

fwiw you should use field_get_values() rather than accessing objects directly anyway.

quick storage

Gábor Hojtsy's picture

Well, fago said in the attached IRC meeting that he'd propose we change the primary key on the node table from nid to (nid, language). Then you can store properties the same way and can use them in joins just as effectively. I'm sure this has wide ranging consequences but maybe not as much as moving properties out of the node table (which leaves the node table pretty empty then).

Language fallback

tsvenson's picture

Using the same nid for all language versions, and then add language code to make them unique, sounds like something that will make it much easier to manage multilanguage sites. Know the nid, then find out the language content is available for.

Should also make it easier to do language fallback...

Is it going to be using the two (en) character or five (en_gb, en_us) codes for this?

--
/thomas
T: @tsvenson | S: tsvenson.com

I'm wary about breaking the

Crell's picture

I'm wary about breaking the very consistent "single opaque surrogate key" pattern just for language. The consequences of that are obscenely far reaching, and I'm quite sure that once we introduce a "variant" of a node for language we will find all sorts of other variants for other entities; Eaton and I actually at one point kicked around a proposal in Barcelona for what became Entities to have flexible "variants", where for instance a file would have a video variant, audio-only variant, transcript variant, etc. The use cases we came up with were legitimate. Our attempts to figure out a non-horrid way to store and reference such use cases decidedly less so. :-)

this already happaned

Gábor Hojtsy's picture

This already happened, right? Language dependent storage is part of the field structure in Drupal 7. Other variants for fields are not universally available. There is no full appreciation of this on the entity level though, which makes it obscure (eg. many people wondering how to specify the field language, since it can depend very heavily based on context and there is no standard indication). Regardless of whether we admit this in the main API/storage, we already have/need ways to identify the language based sub-entites of the entities. That is at least needed for saving and editing at the moment, but obviously is also needed for relations. Anybody who cares about the language versions will need to deal with entity as its ID and language variant code, so only those who don't want to care about that could forget that there are language variant sub-entities there.

Are there any other variants supported that I'm not seeing across fields? Is it better to keep this variant support obscure and hidden for the sake of first look simplicity?

A little confused about image fields

tsvenson's picture

Very interesting video, lots of great explanations about where translation options are heading for Drupal.

One thing though that the video doesn't really explain in relation to the translatable fields is the text content often accompanying an image.

While the image itself might not need a translation and is preferred to be reused on all the translated version of the page, the ALT and TITLE options will still need translations. Will it be possible to configure so it is possible to translate those, and other fields attached, while still be able to have only one entry for the image field?

--
/thomas
T: @tsvenson | S: tsvenson.com

Current implementation of

fietserwin's picture

Current implementation of i18n_sync does what you are asking ([#1142178]). So I guess that behavior should be kept when specifying in entity translation.

Looking at the details: an image field value has a fid, alt and title part. So translating an image field value means that all these 3 parts are copied to the other language. Thus strictly speaking the file reference will be duplicated and thus can get out of sync. I guess it is not possible to store the fid under 'und' and the alt and title under specific langauges ('en', 'fr',...).

I would think it will get out

tsvenson's picture

I would think it will get out of sync very easy. Theres no real easy way of replacing an existing file for a node, you need to remove the old and upload a new.

The next issue if when the image is embedded in the text area, almost often cached using an image style. Thus the old version will most likely be used on all languages, but the one you update.

The Media module will hopefully be able to go far in sorting this out, since it doesn't embed a hard link the the image style image, but a smart object (data) that then is used when creating the output of the content. So, most likely using it will at some stage propagate the new version of the image onto all pages it is used.

Still, a lot to think about when developing the editorial UIs to make sure this is rock solid with minimum of complexity for content authors/administrators.

--
/thomas
T: @tsvenson | S: tsvenson.com

Author and Translator management

tsvenson's picture

I like very much that status, author and other things will receive more flexibility as well as allow for creating more robust workflows and permissions.

However, I believe there is still a few things missing, or wasn't fully explained in the video, when it comes to how to manage the author and translator between the various languages content are available for.

The Author is the person who originally produced the content, which could be text, image, audio or video, and should be easily exposed on all languages it is available for. Translator is the person, or machine in some cases, that takes the authors original work and translates it into a new language.

Especially on commercial sites it will often be needed, sometimes even required based on agreement/law, that there is information showing who is the original author, what language it was originally produced and and who made the translation.

If this information was available, as well as metadata about which revision that was used for the last translation, then it would be possible to implement very powerful workflows, such as:

1 - The original content is updated enough to warrant a new revision.
2 - When saved and put in the right workflow stage, it triggers notifications to the translator for each language that they need to amend the translation.
3 - They can click a link that automatically uses diff to show the changes between the version they used for the existing translation, and what had changed since then.
4 - They perform the amendments and the updated translation can be set in review workflow stage.

I believe having possibilities like this not only is desired, but it might very well be one of the most important pieces missing to give Drupal a killer argument for anyone that are looking for a Web CMS for a multilingual site.

--
/thomas
T: @tsvenson | S: tsvenson.com

partly in the two models

Gábor Hojtsy's picture

First of all, what will actually happen in Drupal 8 is still largely a question. You know we'll be able to tell that when it is released. Things are worked on and then either committed or not, and even if committed can be pulled back out later (such as title as fields and translatable fields UI happened in Drupal 7). So all available hands are appreciated to help! This is just a shiny idea at the moment and still needs to be verified, get accepted and then worked on.

On your concrete workflow suggestions, those can only come once we have the basic problems solved of course. The object relation solution in core has these features in very simple ways already. You can mark the change on the original object as major and the translation nodes get marked outdated. The authors are separate so you can display them as "translators" if your requirement has that.

For the in-object translation solution, the contrib workaround in entity_translation to duplicate certain fields duplicates the author field to some degree, so the information there can be used for the same. Drupal core does not really have good workflow support, Ken Rickard will talk about this in London (http://london2011.drupal.org/coreconversation/1530-1600-d8-needs-better-...).

Site builder perspective

tsvenson's picture

Thanks for the info Gábor, straighted out a few things for me. I'm just taking my first steps into actually coding for Drupal, my main focus is on the site building aspect.

I know I can be a bit over-explanatory sometimes, but I hope that the use cases I try and contribute with is of some use for you, in the same way as the replies I get is most useful for me to better understand both the current and future possibilities building sites with Drupal.

--
/thomas
T: @tsvenson | S: tsvenson.com

What were the cons of the relation model?

mazze's picture

My VERY FIRST impression is: the FIX solution might be a very difficult way to simulate all the advantages the relation model has anyway, so why not going for a more relation-minded model anyway. Multilingual content is about complexity, and a relation model usually is a pretty good model to tackle this.

Yes, there were a few heavy shortcomings in the D6 model. Having redundant content might be the biggest one (and perhaps the reason why field translation was brought into D7 – I am just guessing).

Did you ever discuss to possibility of having a relational model, but with different levels/types of relations? Some fields would have their own fields in the translated node, others would refer to the source node (e.g. images).

I am really looking forward to the media module (also for connection to external DAM systems), and any of those use cases seem somehow more "relational-ish" to me.

It's a mind bender indeed – but just to make sure: do I forget any major issue with the relational model apart from unnecessary content copies?

translation sets as entities

catch's picture

Did you ever discuss to possibility of having a relational model, but with different levels/types of relations? Some fields would have their own fields in the translated node, others would refer to the source node (e.g. images).

I started work on this a couple of years ago in http://drupal.org/node/132075 but later abandoned it in favour of field translation.

The idea was to make translation sets fieldable (at that time entities as a term didn't exist in core, but it's the same thing). So you would have a 'translation set' entity (+ bundles), and the fields from this would be edited/loaded in the same place as the fields from the node.

This way, when adding a field to a node type, you could decide whether it would be applied to the individual node or the set.

Pros of that approach:

  • The decision for site admins is limited to individual fields more or less.
  • Data doesn't need to be duplicated.
  • The model could potentially be applied to other entity types consistently.
  • It would be possible to do a data migration path from both current models to this approach if it actually turned out to be good since it is kind of a hybrid.

Cons:

  • as well as a node for each translation, you have an additional entity independent of these that has it's own field storage.
  • You have to merge fields into the entity form and rendering somehow - this is possible but can get ugly fast.
  • It would currently be impossible to move a field from the nodes to the translation sets or vice-versa (although migration from translation set to nodes in the set should be simple enough to implement compared to the other way 'round).
  • Doesn't work for anything except fields - unless modules make themselves entity agnostic (which may well be a possibility in Drupal 8 though with a proper entity API).

the race to one translation model

Gábor Hojtsy's picture

Right now the main problems are that we have two models, and users need to choose from them but also that developers need to prepare for them. Someone who works on a module that needs to relate to entities, let's say signup or organic groups would need to prepare that the user either has a single entity, or a translation set of entities or an entity with possible language versions inside. That is three different models that the relation would need to support. So its not only painful for users to choose the model (needs lots of background info, hard to cross-migrate), but also for developers to support all.

Yes, we've been discussing attaching fields to translation sets instead of their entity members and sharing fields that way. I was a historic proponent of that model. That would complicate the field model even more, so that you'd have two models for attaching fields, this new case would then share the fields among other entities. It would complicate loading and updating the entities. And we'd still need to maintain the field translation model for the in-entity translation model.

Instead of introducing yet another new concept for field attachment when people struggle to understand field translation already, it seemed like a worthy goal to improve the developer experience of field translatability, and apply similar concepts universally to properties, which lets (a) modules that don't care about translation still work out of the box (b) modules that do care about translation only need to work with two (just an entity and translations under an entity) models instead of three. (In other words, if we keep the relation model and introduce field sharing there, we'll have two entirely different translation models and two field relation possibilities as well, complicating two crucial systems).

Theoretically we can achieve similar simplicity by dropping field translation altogether and implementing field sharing instead. Given the kind of momentum behind field translation though, and that we already worked on and know the strengths and weaknesses of it (with weaknesses to be fixed), dropping it altogether for an entirely different still imaginary system sounds like not the best idea at the moment. Also, our competitive analysis with systems like ezPublish showed that their own field translation system works pretty well (is nicely integrated and applied universally), which gives us more confidence that we should move forward on this route.

I worked a lot on the field

catch's picture

I worked a lot on the field translation issue as well. However I am extremely concerned about the proposal of having the entity primary key as nid + language. This means ending up with more than one record for the same entity in the entity table, which is not reducing complexity at all. It's taking a 'single entity' model, and then retro-fitting multiple 'entities' (under the auspice of different versions of the same entity) to it in a very roundabout way. There's not much discussion of how this might look yet, and I might be over-reacting, but that's my initial reaction to it.

A big reason to go for field translation was to prevent every module having to integrate with language or tnid - there are voting API, nodequeue, flag etc. issues where you can see what was needed to allow those to deal with tnids in addition to node IDs. In other words we wanted to make it so that contrib modules don't have to care about translation at all - they just implement their stuff as a field and get it for free.

The presentation is right that this mostly ignored what would happen to object properties. Mostly two assumptions were made 1. title would be a field 2. things like status/sticky etc. are hard coded one-offs and we need a better API for these anyway. Although the problem of these becoming 1-many wasn't really discussed that I remember and it does introduce problems.

Adding a compound primary key to entities is going to bring back the same issues as having both nid and tnid to join on did. Now this was only mentioned in the presentation with not really any indication of how it'd look, but it sounds like every query that joins on node now needs to specify language too, or support two different possibilities (a single entity record or multiple sub-entity records). If field translation requires doing this, then maybe it wasn't such a good idea after all.

The idea behind bringing up translation set fieldability again was this:

We make Drupal 8 so that modules integrate (at least from the point of view of relations and fields) via entity hooks or field types + fields. There would be no integration with /just/ nodes or /just/ users any more (that's reserved for tweaking).

So we would get rid of field translation, keep translation sets, then the choices for site builders are these:

  1. Translatable stuff goes on the 'main' entity (node etc.)

  2. Shared stuff goes on the translation set.

For references, everything integrates with entities. So it is not a case of modules integrating with translation sets, but instead when you set up a reference field, you allow either nodes or node translation sets to be referenced - and the same for anything else.

I do agree that exposing both nodes and node translation sets to admins could be tricky, but I dunno if that is going to be any worse than a 'sub-entity' concept to deal with.

Not saying this is a good idea, but wanted to bring it up as a 'third option'.

not everything is a field

Gábor Hojtsy's picture

Not everything is a field. An entities' signups, or group members or votes are not field data. They are related data that is not of the entity, but related to it. Now, if we implement votes let's say as possibly translatable fields, then the relation to the external data storage is what's "translatable", so you can have per-language votes or signups. So the relation then becomes per-langauge by virtue of being a reference in a field.

In other words we wanted to make it so that contrib modules don't have to care about translation at all - they just implement their stuff as a field and get it for free.

Yeah, well, translation sets defaulted to "contrib modules need to care about translation" while translatable fields default to "contrib modules don't need to care about translation", but those self-resoecting modules that do want to support translation, I don't think we can avoid them working with language information directly. It cannot be just assumed that an outside system will handle it for them. If I'm collecting votes, and the user choses whether it becomes per-entity or per-language under the entity, my module needs to be aware of the possibility of the difference, right? How would you make it transparent? Even if we implement all entity relations as fields on the entity (menu item, author, etc. become fields) and get translatability for free that way, we'd need to handle that in modules that use that data. They should be aware it can be different per language, be able to cache that data per language, etc.

Either module we choose, modules need to be aware of language support if they want to support languages. The field model was attractive because it makes stuff work by default for careless modules, however that does not make it work without effort for modules who do want to care. In this sense, for modules who do want to care, the identifier for what they work with becomes the entity with the language in question de-facto. We can avoid exposing this in the API (such as passing language by the side in a context instead of directly in the API), but that is the conceptual approach, anyway, no?

Entity ID + language

plach's picture

TBH I am concerned about switching to an Entity ID + language key too. During the IRC meeting we talked about making properties translatable, but I am still wondering about the definition of property:

[5:03pm] plach: GaborHojtsy: I'm trying to understand if the disticntion between properites and fields can be built exaclty upon this concept: a property is something so internal that does not need to be translated
[5:04pm] plach: nor edited, nor displayed, ecc.
[5:04pm] GaborHojtsy: plach: would work for me definitely
[5:04pm] plach: title, author and status do get edited/displayed/ecc.

When I thought about fields with optimized storage, I though about replicating what currently ET/Title (more or less) do: we have values in the source language, those can be picked from the main entity table (label, language, author, status), and we have translated values that we pick by joining on a {translations} table. This way the main entity table would act as a form of materialized view, caching the values that are likely to be needed more often (in the case of a monolingual entity the {translations} table would not be needed at all). This would mean special-casing some (key) properties (fields!) unless we introduce some kind of MV support in core.

how to encapsulate this?

Gábor Hojtsy's picture

How would you hide this under an API? The status and author fields are used extensively for workflow and permission information for example. If a site wants to make these translatable, the workflow and permissions should follow. Now all permission modules would need to handle the translations table data then? Or how would this be hidden under an API for easy consumption? I think the idea is not to make translation a special case, because that highly decreases the likeliness of it being supported, but rather something baked in the API. Fields do this with the language key, that puts it pretty much in your face, hard to ignore. How would you abstract this for properties like status and author (I'm using property for stuff on a node that is not a field as per the video), so that a permission module would not need to handle the translation table if they want to support multilingual sites?

Edit: Also, how would a translation table handle entities with different (arbitrary) properties that need translation support and how would that interface seamlessly with the rest of Drupal's system (load/save/display/permissions/workflow, etc)?

Field translation should not

catch's picture

Field translation should not be in your face though. If you use field_get_values() that handles all the language stuff, no-one should ever be writing $node->$foo[$langcode][0]['value'] ever (lots do but this is a documentation failure and exposes limitations in the API, but it's not by design).

If we have entities as classes, then $node->field could return the same thing as field_get_values() more or less - without exposing the underlying structure unless you print_r() and look at $entity->storage (which could be a protected or even private property if we went this far).

Also, how would a translation table handle entities with different (arbitrary) properties that need translation support and how would that interface seamlessly with the rest of Drupal's system

That points towards materialized views again, and layers like EntityFieldQuery that are storage agnostic. These things come at a cost (abstraction, complexity) but it might be possible to make trade-offs against abstraction and complexity elsewhere.

URL aliases

guy_schneerson's picture

One other issue is URL aliases, while the URL aliases can be language specific as part of core the integration with content/entity translation has an issue
• Once a node language is set from language natural to a specific language the fallback doesn’t work for URLs like on other fields so you get a node/x path for all none translated languages

Two more issues exist with contrib module integration
entity_translation
• The translate interface doesn’t have the original url as a default like it does for other fields
Pathuato
• No integration with entity translations only works with node translations.

While we've been discussing

Jose Reyero's picture

While we've been discussing all this for years, I think it is the first time we can -kind of- see all of this working with some contrib modules. We have entity_translation and i18n now available.

I really think concepts should be tested in contrib as much as possible before we ever think of reworking the whole Drupal core data model. For instance, if we had ever built a multilingual-cck when fields were still in contrib we could be way better now.

Maybe we should start building our multilingual sites with one or the other or both modules and see the real life issues that arise with each of them.

the point of this thread

Gábor Hojtsy's picture

Yes, the point of this thread is to gather more of that kind of feedback from people who are deeply knowledgeable in the field/entity system as well as those, who used these tools to build sites with different use cases. Damien Tournoud kept saying in the first IRC meeting he has a good supporting use case to share for going exclusively with entity_translation (you can look up in the IRC transcript). Still looking for that use case, I'm chasing him ever since. Also looking for any and all counter-points to make sure this could fit the use cases.

I've also set up an in-person use case review for people at the Montreal Drupalcamp (after the multilingual code sprint) to (in)validate this model: http://www.drupalcampmontreal.com/multilingual-problem-use-case-sharing. I'm also planning to set up a similar BoF in London but so far had little time to match my other appointments to make sure I put the BoF at a time/place where I can be available. Also, BoF scheduling for London just started a couple days ago. (http://london2011.drupal.org/conference/bofs). Some BoF times are filling up already, wow, maybe need to be faster placing that then :)

Your concrete feedback on the viability would be great!

update: London BoF submitted

effulgentsia's picture

See http://drupal.org/node/1498634#comment-6063794 and below. At the time I write this, compound primary key of nid,langcode in the {node} table (and similar for other entities) is the leading contender. A large D8MI code sprint is starting Monday in Barcelona, so please comment on that issue before then if you have feedback to add there (but please don't clutter that issue with comments not directly relevant to the storage question).

Translations

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week