Content language support in Drupal 8

Posted by plach on December 21, 2011 at 10:10am

One of the major goals of the D8 multilingual initiative is introducing a consistent language support in D8 core. Our definition of content here is what can be modeled through entities, which in D8 should become our official Data API. This is a (pretty long) writeup aiming to introduce all the aspects involved in this task. It is split in 5 sections treating increasingly lower level aspects, hence anyone can skip to the matter she is interested in. Every section summarizes the current status and points to appropriate place where discussing the details:

Overview
User interface
API and developer experience
Data structures
(Default) Storage

Overview

While adding full content language support to Drupal 8 we will have to support two main scenarios:

Language assignment is the capability of determining which language has an entity and which language has every field/property (component) attached to it. In the simplest case we would have an entity whose language is the same of each of its components. In the most complex one we could end up dealing with entities whose components have languages largely differing from one another. For instance we might have regular textual fields whose language matches the (parent) entity language, non-linguistic fields such as numbers which don't have a language assigned, citation fields which hold content in a language different from the main one, and fields whose value is imported from an external source for which the language information is simply not available.
This capability will allow us to mark our content up with semantic information about languages, which is an accessibility requirement (see http://drupal.org/node/1323338). Moreover having such information available is a key factor while dealing with multilingual content.
Multilingual content handling involves the capability of creating a variant of the entity components in a given language. Each variant has a language assigned and is what in Drupal 7 is called an entity translation, that is a subentity to which the new values are tied. Usually these values are supposed to convey the same semantic of the original ones (translation), but from an implementation perspective we just need to ensure we are able to support different values for each available language.
In Drupal 7 fields have the translatable property, which has a somehow misleading meaning: a translatable field can have a different value for each language, otherwise its value will be shared among all the entity translations. This approach does not allow to fully support the language assignment functionality, since it works by assigning the field translation the und language code, which indicates that the content language is unspecified. This is a suboptimal solution since a "latin citation" field might be shared among all the entity translations, but still need to be assigned a language. OTOH a numeric (non-linguistic) field might hold a different value for each entity translation. This indicates that we need two distinct concepts in D8: an entity component can be linguistic or not, in which case it needs be assigned the zxx language code (see also http://www.w3.org/International/questions/qa-no-language); at the same time a component can be shared among the entity translations or belonging to just one of them. The former is an intrinsic property of the component, while the latter needs to be configurable since its value depends on the use case being implemented. Since the main work here is on the UI side and this is not our main use case, the core behavior could be of still assigning the und language to shared components, leaving everything else to contrib implementations. We would just need to ensure that language assignment at component level is not made impossible by our design.
The approach described above is only one of the possible models, although the one we are leaning towards at the moment: see the big content translation models debate (with video!) for a comprehensive introduction to all the available models.

Both functionalities are involved in content negotiation, i.e. the process of determining which content variants should be presented to users: entities with one or more languages assigned can appear in or be filtered out from entity listings depending on whether they match the defined language conditions or not. Moreover multilingual entities allow to implement language fallback, which in turn indicates the process of determining which content variants to show (and whether to show them) when there are no values available for the requested language.
In the model being described here, language is an intrinsic property of an entity's content (semantic metadata). Surely language can be used to determine which content is best suited for a particular audience, but these are two distinct concepts which must be kept separated. For example we might have content written in generic English (no other language subtag specified), which would be suitable for both American users and British users. OTOH we might have scenarios in which some content is written in American English (en-US) but also British users need to have access to it, and ones in which content should be served only if its region subtag matches the user's locale. All of these use cases can be easily supported by having properly configured content negotiation and language fallback rules in place.

Place to discuss this: here.

User interface

To support the scenarios above we need several screens, many of which are already in place and just need to be revised, but also new UI components. Perhaps we will not be able to have every functionality in core, but we should at least design our UI so that we can implement the missing parts in contrib and have them integrate seamlessly with the core parts.

Language configuration: we need to come up with a UI allowing users to easily configure which languages should be available while creating content. We should keep in mind that not necessarily each language assignable to an entity component is something we actually wish to create a translation in (see the Latin citation example above). Place to discuss this: http://drupal.org/node/1314250.
Component language assignment: we need a way to explictly assign a language to each entity component, since simply assigning to all linguistic fields the language of the subentity they are tied to (language inheritance) would not allow us to fully support the "language of parts" accessibility requirement. Moreover we would not be able to meaningfully support the separation between linguistic and shared component properties. There is some discussion about a possible UI in http://drupal.org/node/1165806 but probably we need a more focused (core) issue. There is an open issue about introducing a new language widget which might help a lot with this: http://drupal.org/node/1280996. However, since this is not really our main use case, we are probably going to support it only though a contrib solution.
Translation overview: we already have a screen providing an overview of the existing translations for a single entity and allowing to create missing ones. No plan to revisit it at the moment, unless the UX team has some improvement to suggest.
Multilingual entity form: the Entity translation module, which is serving as a pilot project for the D8 content translation system, currently offers a separate form to enter the translated entity component values. In the last i18n sprint in Montreal we decided to drop the translation form in favor of a language-aware edit form. The main argument in favor of this choice was that users should not see a different form depending on whether they are editing the original field values or the translated ones. This is even more true for D8 in which the separation between original content and translations might be more blurred than in D7, at least from an architectural perspective. There is active work going on over there and feedback form UX experts would be welcome to come up with a solution we will be able to port to D8 more or less as-is. Place to discuss this: http://drupal.org/node/1188388.
Global translation overview: at the moment core is totally missing a global overview of the content translation status allowing users to tell what is translated and what not and the freshness of translations. There are some solutions is contrib for this, which could serve as a good starting point. No open issue at the moment, just a somehow related one: http://drupal.org/node/497804.
Transation workflow: we need UI screens allowing users to handle the content translation publishing status and mark translations as outdated. Moreover translation administrators need to define per-language edit access, assign different translators to their translations and verify the translation progress. Having full support for the most common translation workflow (once we have clearly defined it :)) would be a really nice thing to have, but probabaly this is a good candidate to be actually implemented in contrib. However designing the UI as part of the core process would help ensuring we end up with a consistent user experience. There has been some discussion about this in http://drupal.org/node/1277776#comment-5169332 and following but we need a dedicated issue.

Place to discuss this: issues cited above and http://groups.drupal.org/node/162159.

API and developer experience

One of the main goals of D8MI is (should be?) making i18n support ubquitous and coming for free whenever possible. This should be true both for core and contrib, as this is the simplest way to ensure that most of the D8 ecosystem is multilingual enabled. This means making ML aspects totally transparent from an API consumer's perspective. We should try our best to design the entity API as if we had no ML support and limit as much as possible its DX impact.

We identified two main approaches to deal with ML entities in a language-unaware fashion: depending on the context it might be appropriate to act on data related to a single language or to all the available languages. The D7 Field API field_attach_* functions act on every available language, with the exception of field_attach_view and field_attach_form which instead act on the current language and the entity (original) language. From the experience gathered until now, we can tell that storage operations are good candidates for the all-languages workflow, while read/write operations at data structure level are mostly single-language based. More details on this in http://drupal.org/node/1260640.

In the single language workflow we need a default language, since language-unaware coders should not need to specify one. Which language default to depends on the current context, for example:

if the entity component data is going to be read/written in a view (read) context, then we need to default to the current content language;
if the entity component data is going to be read/written in a form submit (write) context, then the correct default should be the form language, i.e. the language of entity variant being created/edited.

This suggests that we should have accessor methods that exploit the default language to return/set the appropriate values. They probably should also have an optional language parameter to allow coders to explicitly specify one. Which exact signature should have accessors is a more general issue that should be dealt with while designing the D8 Entity API (http://drupal.org/node/1346204).

A possible solution to always have a smart default in place might be exploiting the context stacking capability (see also http://drupal.org/node/1337114): this might allow us to have a stateful entity object without having to worry about explictly handling its state ("dependency inection", "dependency inection", "dependency inection", spin, goat sacrifice). We could inistantiate an entity and provide it a context object whose content language contains the appropriate value depending on, well, the current context. Having a dynamic context-dependent default would be a key factor in allowing us to deal with language-aware entity tokens, for instance.

Also ML search should be taken into consideration: there are a couple of issues open in the Entity translation module queue (see http://drupal.org/node/1335394), perhaps we should revisit this once we have come up with a fully-working solution in contrib.

Place to discuss this: http://drupal.org/node/1277776.

Data structures

There is active discussion going on about the definition of fields and properties and their relations in http://drupal.org/node/1346214. This will have a relevant impact on the ability of translating properties. There are different approaches to store ML properties: one possibility is mimicking the current Field API data structure and having per-language values stored under the same object field ($entity->property_name[$langcode]), another one is having a translation bucket under which storing all the property values ($entity->translations[$langcode]->property_name). The key point is that in both scenarios all language variants would be always available.
The most concrete proposal at the moment is implementing properties as classed objects and having fields extend the base class, hence something similar to the first approach. However, this is really an implementation detail, since entities as classed objects should allow us to encapsulate the actual implementation. OTOH this will have an impact on how the property base class would be implemented: should it hold all the language variants or each variant should be a distinct class instance? We have neither an answer nor an issue for this at the moment.

Related to this matter is also the ability of translating entity labels. In D7 this is done by replacing entity label properties with a (text) field. In D8 we need to define more clearly what an entity label actually is: depending on the context it may be used as an administrative item, a navigational item or pure content possibily containing markup. Not every entity type might need the last two, while the first one is probably required. The default label implementation could be backed by a (translatable) property, but we might introduce the possibilty to designate a field as a "label provider" instead of the plain property. The Entity interface should expose a method to get the entity label as a plain string, it should be responsibility of the label provider to implement a to-string functionality. We should retain the capability of having computed labels, which might not even need to be translated. For instance, an event date might just need to be localized on the fly, but a value shared among all the entity translations should be enough. This flexiblilty should free us from requiring an actual storage column for the label, which would be available only where it actually makes sense. In this scenario, the pure content behavior could be implemented on a per-entity type basis, if ever needed. Place to discuss this: http://drupal.org/node/1188394.

The entity data structure should hold also the translation metadata, such as publication status, author, creation date and so on. In the Entity translation module these are stored as standalone records that are injected into the entity data structure at load time as an array of standard objects. It might be worth to be investigated the possibility of making them (non-translatable) entities. This would allow us to exploit an hypotetical generalized entity access API to implement translation workflows needing to grant translators per-language access.

(Default) Storage

One of the most important aspects to be covered to add native ML support to entities is coming up with a working storage model for translated content, at least for our default SQL storage layer. In D7 we have partially solved this requirement by adding native language support to the Field API, where per-field storage made adding ML support almost trivial. However we still need to take care about properties, which in turn have no ML capabilities in D7. The Entity translation module provides an {entity_translation} table which partially replicates the {node} one, providing a way to store per-language values for the most relevant node properties (status, author, creation date, ...), which at the moment seems enough to cover the major use cases for any entity type. OTOH this is once again a suboptimal solution since it is definitely not flexible enough to handle arbitrarily complex use cases.

Some proposals have emerged from an earlier discussion on this subject:

Entity id + language as primary key: in this proposal the {node} table (but this concept works for any entity type) gets also the language column as primary key, this way all node language variants could be stored under the same table and every property could have a language variant natively. The advantages of this approach include consistency with the default Field API storage and somehow forcing most code dealing with entities to confront with ML support requirements. The downsides include a more complex logic while joining on the entity tables and the introduction of a one-off solution for language, when we might want to extend entity variant support to an arbitrary number of axes (see "Variants"). Moreover the consequences of such a pattern shift are obscenely far reaching [Crell] and might not be well understood at the moment.
Translatable schema: the main alternative dicussed for now is having parallel {entity_type_translation} tables holding the language variants for each entity. These would be created only when enabling entity ML support by inspecting the entity schema definition, in which ML properties would be marked as such. This approach could exploit the translatable query tags, introduced in http://drupal.org/node/593746 with a slightly different use case in mind, to alter queries in a transparent way. The major advantage of this approach would be that sites not needing ML support would get no overhead, while sites needing ML support would get a full-fledged solution, provided that we manage to design a system that is able to handle most (ideally all) ML storage use cases automatically. The major drawback here is that we might leave something uncovered (consciously or not). Parallel tables could be also implemented as materialized views, but probably this approach would be unnecessarily complex. The translatable schema solution has been initially proposed in http://drupal.org/node/367603.
Translation set sharing: a third solution, which did not get much attention but is probably still on the table, is dropping field translation altogether and keep translation sets: translatable properties/fields would be attached to the entities, while shared fields would be attached to translation sets. This solution would not fully solve the need to share properties, but in turn would require almost no change to our current storage model. The main drawback is that we would be still dealing with multiple entities, that is per-language ids, which in many situations are really problematic and were one of the main reasons why field translation was introduced.

Whichever solution we come up with, we should ensure we have a clean separation between API and storage: while designing translatable fields, we agreed that a loaded entity object should hold all the available language variants. Having all data in place helps dealing with ML workflows without worrying about the actual storage and makes caching far easier. Pairing this approach with a storage-agnostic query builder like EFQ should allow us to write code that is truly storage-independent and this is a benefit that has a much broader scope than just ML support.

Place to discuss this: we might want to hijack http://drupal.org/node/367603 or open a new issue. To be defined.

Background links

Translatable fields UI
Revert node titles as fields
Convert node titles to fields
Field language issue
Original translatable fields issue

Comments

Current status

Posted by kristen pol on February 2, 2012 at 5:38am

For a current status of the project, check out Gábor Hojtsy's Drupal 8 Multilingual Initiative "rocketship" page.

Kristen

Kristen
Contact: https://www.hook42.com/contact
Drupal 7 Multilingual Sites: http://www.kristen.org/book

Content language support in Drupal 8

Overview

User interface

API and developer experience

Data structures

(Default) Storage

Background links

Comments

Current status

Internationalization

Group organizers

Group categories

Content categories

New groups

Group notifications