Drupal 5.x and before has a "locale" concept, which lets a user see the interface of the website in a choosen language. The list of available languages for locale is defined by the administrator, and a user can choose a preferred language on the user profile editing page. It is only possible to translate the built in interface text though, and none of the admin specified text (ie. site name, site slogan, menus, categories, blocks, etc) is translateable. It is also not possible to translate content sent in by different users of the website.
Some solutions evolved through the years to overcome this limitation, making it possible to translate user defined interface and content elements. Unfortunately these modules only build on what is available in Drupal and are not as closely tied as some performance requirements or design considerations would require. Since the Drupal core code is not ready to handle this situation, most contrib modules don't care about offering this feature, and internationalization (i18n) modules need to work around them too.
In getting ready to add support into Drupal 6 for better internationalization, we examined the code base of the available modules (although these are actively changing), identified the problems, looked at the possible outcome of some development around custom node types, menus, etc. The summary bellow is a list of items we worked on last week in Budapest and reasoning behind some implementation details.
Try the code! Suggest better implementations!
You can try the code by grabbing our (temporary) Drupal fork from the Development Seed sponsored SVN repository. As we have the Drupal CVS metafiles comitted too, you can do a cvs diff on the codebase to see what did we change from the Drupal 6-dev. And because what you get from the SVN repository is a complete Drupal core, you can install it and try it out. We did not get around yet to provide upgrade paths, so do not try this code yet on a previously used Drupal database. Do a fresh install!
Language management
In Drupal 5.x the langauge list is managed by the locale module, and the attributes of languages is restricted to those needed by that module. I18n related contrib modules reuse this list of languages, some adding more properties. We found that Drupal core itself would need more properties for each language, so that themes can be made to support different language behaviour. After discussing this with Dries Buytaert, we decided to keep the language management functionality in locale module, but it is now on a separate primary task.
What changed?
- "locales_meta" table becomes "languages"
- The "locale" key name becomes "language" (also in "locales_source" and "locales_target" tables)
- The "languages" table gets a new "direction" field, so we can store "Left to right" (LTR) and "Right to left" (RTL) directionality information, which is even needed by core themes to adapt to RTL languages. The built in language list gets new direction values, so automatically added languages have this value properly set. [TODO]
- The "languages" table gets a new "native" field, which stores the native name of the language to show on language switchers. This is important for those who visit a foreign language page, since we can help them spot their language with using the native name (too). The built in language list already has a list of native lanuage names.
- The "languages" table gets a "weight" field, so languages can be weighted. This can be used for language negotiation (but is not used for this (yet?)), and can be used for listing languages by weight for language selection.
- The "languages" table gets a "prefix" column, where a path prefix can be specified for that language.
- The "languages" table gets a "domain" column, where a language domain can be specified for that language (eg. de.example.com or example.de).
- The "languages" table looses the "isdefault" column, in favor of storing the serialized default language object in the variables table. More below.
- The language management screen is now a standalone primary task, with listing and addition as secondary tasks. The locale import/export functionality is a secondary task of the locale functionality.
Performance
The language list is needed by locale module and is also needed by any extending functionality to handle languages on Drupal objects (nodes, taxonomy, blocks, etc). The language list is not needed on a site where there is only one language used (English by default, but can be set to a different language). For this reason, we introduced two variables: "language_count" and "language_default". We always store the number of languages set up into "language_count", and if it is only one, we consider "language_default" to be that one, and use that exclusively. This means that we can completely sidestep language detection and other language management code when serving pages on a single language website, as these two values are directly loaded with other variables anyway.
Impact, future usage
The user sees one single spot for setting up languages. Since languages have their own forms, any extending functionality can hook into them (see hook_form_alter()) and add more settings to languages. Having the weight and direction information, themes can adapt to finer language detail, like adding the appropriate HTML attributes when handling an RTL language.
Language negotiation
Once a site has more than one language to show interface and/or content in, somehow the used language or languages needs to be choosen. We have decided to go with a single language selection for now with fallback to the default language if possible, but there are valid use cases for more complicated sites where a list of possible languages (eg. for content filtering) is desired instead of a single language.
What changed?
Because the decision on the language used has an impact of the variables being loaded, the cache being checked and so on, it was obvious, that a bootstrap step should be added to check for the language, and decide on the language used. We examined the existing code used for page caching, and found that it is ready for dealing with domain and path prefix changes, so to not hurt cache served page performance, we added language negotiation right after the "late page cache" bootstrap phase as a new phase.
The $locale global variable become $language and is now not only a language code, but a full language object, which contains information about the native name, the direction and other attributes of the language. This is directly passed to phptemplate themes too.
We added language.inc to get loaded when multiple languages need to be handled. Language negotiation (the selection of the displayed language) and path handling is now implemented there.
How does it work?
- If there is only one language set up, the default language is used always.
- If there are multiple languages set up, the admin can choose one negotiation scheme.
- No selection. This always chooses the default language, and is the default scheme. It is handy to keep previous practice (ie. you don't suddenly change your URLs when you start to set up languages).
- Domain based. If a language specific domain is found (arbitrary domains can be set per language), that language is used, otherwise the default. This scheme could have paths like http://example.de/node/1 and http://example.co.uk/node/12
- Path based with empty default. All languages use the same domain, but different path prefixes. The dafault language has no prefix, so if a path without language prefix is identified, the default language is selected, otherwise the requested language. This scheme could have paths like http://example.com/node/1 (default language) and http://example.com/de/node/14 (German).
- Path based with default as redirector. All languages use the same domain, but different path prefixes. Every language should have a prefix. If a path without a language prefix is found (eg http://example.com/) a negotiation algorithm runs. It tries to check for the user language (set on the user profile page), then falls back to the browser preferences, then falls back to the default language if others fail.
Other schemes might get implemented, or we might add an extension point for other modules to implement additional fallbacks like IP-to-county based language detection. [TODO]
Path aliases
Once we have available languages and a language choosen, we can provide language based path aliases, if available. This is important to find the right menu callback to fire, so this takes place right in bootstrap.
What changed?
drupal_get_normal_path(), drupal_get_path_alias() and drupal_lookup_path() all got a new optional $lang parameter, which is a language code to look up the path for. This way drupal_init_path() inits the path with a language specific alias, if available. To store language specific aliases, we introduced a "language" column in the "url_alias" table, and a language selection box on the alias editing/addition page.
Performance
Our current implementation has a (possibly small, not benchmarked yet) performance impact on users without language specific aliases. Ideas are welcome to improve the situation.
Variables (ie. site settings)
The inherent problem with variables is that modules provide all kinds of interfaces to set them. Some variables are computed, some are not available on the user interface to set, and some are set with exotic interfaces (with possible interdependencies, etc). All we have been able to agree on was that we need to provide language dependent site settings, and if we do it in Drupal core, we can better cache what is expected to be required by the page, so we get better performance for served pages. How the user interface will look is not finalized yet. [TODO]
What changed?
variable_init(), variable_get(), variable_set() and variable_del() got a new language parameter, which is a language code to work with. We introduced per-language variable caching, and we added code to be able to load variables for any desired language, and then perform a fallback on the default, if the requested variable is not available.
This part of the code still needs some polishing, but the general idea is there. [TODO]
Nodes
We discussed nodes to great length, especially in the light of some user defined field code possibly getting into Drupal core.
Problems with "inline" node translation
A few people noted before that we should support node field level translation, and should not store translations of nodes in different node instances. For that to work, basically all node properties should be a field (ie capable of being replaced by a different value, when translated). That means that from the perspective of i18n we should make taxonomy terms, uploaded files, author ids, submitted dates, titles, bodies, node level properties like published, promoted, etc. "replaceable fields" of the node. So when a translation comes along, we can associate different pieces to the translated node.
What would remain constant in all cases, is the node id. In case we would like to have this perform well, we would need a powerful query rewriting mechanism, where we can request an author id, published flag, taxonomy term association, etc. with a language condition for a node id, so that unused data will not get queried, and we don't need to have multiple queries for a node to grab different fields from different tables.
Still then the node id would remain, which is what node level permissions connect to (additionaly to taxonomy terms, uploads etc), so permissions should also be made aware that a node id is not a unique content id anymore.
It seems that if we would like to implement this with good performance in mind, we would need to overhaul the node concept and tack language into every aspect of it. This would probably not get a warm welcome from those who are not interested in multilanguage sites.
If you have better ideas here, speak up!
Translation sets to the rescue?
Because with languages, we are basically adding another layer on top of nodes anyway, why not do the implementation from this perspective? With earlier implementations, the problem was that 'nid-language-nid' relations were stored. However, if we introduce the concept of "translation sets", we can group translations of the same content into one set, and give that set an ID. Either this set is itself a node, and contains the common properties of all items in the set, which are projected to the nodes when being edited/displayed/queried, or we need to be able to mark shared fields in the set member nodes themselfs. Either way, we would probably need to copy the values over to the set member nodes when being saved, so we don't need to hit the database with the node mashup algorithm on every query. This is why it is probably better to not have the set as a node, since that node would be a rather thin one in most cases, actually making it unusable.
Menus, taxonomy, blocks, profiles
These are still to be looked at more closely implementation-wise. Blocks and profiles are expected to be simpler, since we do not intend to provide capabilities for users to translate their profiles to different languages, just administrators to translate the field names and descriptions to different languages. Blocks are very similar.
Menus and taxonomy are a whole different question. Some people like to see different menu and taxonomy trees for different languages. On a website, where only a few pages are translated, the translated versions will have a lot less menu items and a definitely smaller taxonomy term list to work with, since listing hundreds of empty pages for taxonomy terms nonexistent for that language is simply not an option.
For those sites, where all taxonomy terms have equivalents in other languages, a different UI option can be presented so that it is easier to edit taxonomy terms and translations.
Expected future work we intend to build on
Some of the future work possibly in the Drupal 6 timeframe which affects our work:
- Menu system. The user defined menu item system is under rework, so we are awaiting what we can work with.
- SQL query builder. There is word on the street that a better (not preg dependent) SQL query builder is on the way. We look forward to using that for query rewrites.
- User defined fields. This should open up a whole new can of worms, so we are very interested.

Comments
Gabor, thank you for the
Gabor, thank you for the writeup.
http://www.twitter.com/lxbarth
Great Work
Nagyon jo, Very good, C'est beaux.
Glad to see some of my favourite topics covered (url/prefix based switching) - but the really tough questions still remain - where/how is that node translation stored.
I think this work is truly exciting and cannot wait to see the outcome. Sadly at this time I can only offer my encouragement.
andre
mindshift
Our task first really is to help people shift their minds, to wrap their head around the i18n problem in every aspect of Drupal. Once better language support is there in Drupal, contrib modules will (hopefully) need to work with it. Of course more features are important but we need to do the base things right to build on.
Open issues Feedback
Thanks for the good work. I love to see RTL issues resolved.
Consider this: If we store values for the primary language being stored (or at least cached) in the previous data structure, we won't have any performance peanulty. Hope this helps.
Consider using database views according to the current language - it might help with performance.
So we will access the 'node' view instead of the 'node' table and the query might stay the same. Hope this idea helps.
Cutom admin-defined blocks should still be translatable.
Amnon
Drupal Israel / Drupal Hebrew Translation
Multilingual themes - Automatically load RTL CSS files
I've just finished writing the requirements for Automatically loading RTL CSS files, and opened an issue for that. For all themes, those files should automatically be loaded by drupal's theme engine if the language of the current page is RTL.
Are there any other multilingual theme requirements which should be treated in a similar way?
"Default" language and content display of more than one language
Thank you for your write up and for the work you are doing.
I hope this is not an intrusion as I am not a programmer, but the statement on Language negotiation was interesting to me from an end-user (site visitor and webmaster) point of view.
I think that the "list of possible languages" is a fairly unique feature. Use cases may be few because this capability is pretty rare, but I can certainly think of some. For example, a community site for a city in Canada where many of the visitors would be able to read/write both French and English may contribute their content in either (but probably only one) language--and many of the visitors will want to see all content regardless of language, while others will only want to see the content in their preferred language. Having to switch between French and English to see all the posts (I assume there would be a language switching block for manual language selection) would be counter-intuitive, but check-boxes for both French and English should be pretty easy for most people to use.
Of course in this case, which language is the "default" is another issue. Does "default" mean the default language of the site as specified by the webmaster? Or, does it mean "if the content is only in one of the available languages, show that one", in which case "default" would change with content? The issue gets more complicated as more languages are added.
Asking for this feature may be too much for a first shot at Drupal core, and so maybe it is best left to contributed modules, but I think this use case has some real potential and so if it doesn't make it into core I hope that it is not "precluded" by anything in the new core development.
list of languages
List of languages could be made useful for sure, we will see whether it is possible to come up with something in Drupal core to support this or it will be left for contrib.
Defaults are site defaults as specified by the administrator for the first shot. It might be possible to have per-content defaults, but that would probably highly complicate pages where multiple content is displayed. It is still something to keep in our mind I think, because there is a real practical problem there.
default_language != fallback_language
From my experience the default language should not be equal to fallback language. It can be but is not mandatory. I'd like that the i18n system have a separate fallback language.
________________
Claudiu Cristea
webikon.com
Examples?
Could you please give at least one example?
For example we manage a site
For example we manage a site wich is intended to address mainly a non-english public. Let's say italian. So, when a user will request www.example.com he will get the italian version of the page. This (italian) will be the default language.
Because we are using Drupal wich is built in english we will use english as fallback language. This will make sens if we assume that the italian translation of the interface can be incomplete or new untranslated modules can be added to the sistem.
________________
Claudiu Cristea
webikon.com
Drupalese
I'm just wondering about this very basic question: why is Drupal built in English? Why the preference for one specific human language? Wouldn't it be logical to consider the strings coming out of the engine as "Drupalese" and always localize (to English, for example)? It would cause some performance overhead for English-only sites, but wouldn't many of the other issues like default and fallback, etc. clear up?
The current system also demands that module developers write strings in English, even if the developer speaks little English or the module is something language-specific and practically can't be translated (e.g. Chinese bookkeeping or something like that).
local modules
Well, there is no requirement to write modules local to some language in English. We have functionality (which are turning into modules) for hungarian free hosting providers. There is no reason to do the interface in English there.
But if internationally useable modules are not written in English, but in mixed languages, "Drupalese" would be a mix of languages, translations would end up being from Chineese to Italian and Spanish to English and so on. Whether it is fortunate or not, English is the base language known by most of the members of the international community.
Drupalese
In fact, the interface is in English, from Drupal's point of view, since it makes no distinction between Drupalese (strings in the .module file in whatever language) and the English language used by actual people. On a trilingual English-German-Hungarian site it will serve up all that Hun gibberish to non-Hungarian visitors (even though the whole module makes no sense outside of the Hungarian context).
This may seem an exotic problem until you start to develop multilingual business applications on Drupal where language-specific functionality is the norm not the exception. I may have missed something in your article but it seems to me that we'll have to implement language detection in each of those custom modules, since Drupal won't provide a central interface to switch certain modules off when a page is served to an uninterested visitor.
You might be quite right — the problem is, the "international community" is a minority, even among internet users. The bigger part of the world uses languages other than English as "bridges". Consider the role played by Spanish (Latin-America), French (Central-Africa), Malay (South-Asia), Arabic (Islamic world) or Russian (former Eastern Block). From Záhony to Vladivostok you'll want to fall back on Russian, not English. In Kazakhstan you'll want to enable Kazakh and Russian (and maybe half a dozen other languages), use Kazakh as default and fall back on Russian if Kazakh is missing. (Perhaps this can be managed with the language weights you mentioned?)
English is not an absolute default
Hmmm, language specific functionality... Edith, do you imagine that turning off complete modules per language is overwhelmingly more common, then turning off features (more granular then modules) per language?
The fact that common Drupal modules are written in English does not mean that English will be the 'fallback' in Central Africa or in Vladivostok.
user-friendliness
Truly, I've no idea. But I can easily imagine companies that operate in several countries and their activities are somewhat different in each country. Let's say, you are in the HR business. In Austria you do executive search and workforce leasing and in Hungary only workforce leasing. You develop two custom Drupal modules to handle the two types of your business activities, but currently there's no central mechanism to turn off the executive search module on the Hungarian version of your website (and you'll be flooded with resumés from desperate Hungarian job-seekers, even though you only want to accept applications from Austrians).
I recognize that this isn't an issue for most multilingual sites, only for those that function as web applications. But right now I can only solve this issue by setting up separate sites, which kind of defeats the whole concept of i18n and multiplies site maintenance costs for the client.
Understood. But this isn't very user-friendly. Let's say, you live in the Philippines and you are one of the 17 million native Tagalog speakers. You also speak the official languages of your country: Filipino and Spanish. (That's 3 languages altogether, no small feat!) You set up a Drupal website for your community. Quite a few modules have no Tagalog translation, however practically all of them are available in Spanish. But it's all no good, because when Tagalog is missing, the interface will fall back to English. So you end up importing Spanish strings into your Tagalog translation, maybe into the Filipino version as well - 3 languages and 5 untranslated modules on a site and you have a pretty big maintenance problem.
I'm not sure if this issue is solvable at all, considering how Drupal handles translations. I'm just saying that for the majority of the world's population English is not second language, or third, or fourth. Allowing them to fall back on the second language of their choice would be much desirable.
List of languages by number of native speakers — notice how many significant languages (10 million+) are located in countries where English isn't taught at all.
everything at once
Well, we are trying to make progress in Drupal 6, but as CCK is getting into Drupal step by step, we are also in need to meet some realistic targets. That said, we keep these ideas in mind. By the way there is no way to turn off modules on any condition now (except throttling), although regional differences are there even if you don't consider language. You actually speak about regional differences here, not language differences. An Austrian speaking person living in Hungary would also be off target for this functionality, although the Austrian interface would be presented to him. Turning off modules (or some functionality of modules for that matter) on some condition would be an interesting feature, for sure.
Let me ensure you that we know about this problem. As I have said, there is a line between content and interface translation. Content will only be available in the languages provided by the contributors to the site, while the interface will always be available in English, no matter what happens. There are interesting questions in this area, like what should you show if a taxonomy listing is shown, and you don't have all the interface and content in the desired language. So you say that the listing would contain some Tagalog, Filipino and Spanish nodes, while the interface would be sometimes Tagalog, sometimes Spanish and sometimes Filipino (seemingly randomly from the user point of view)? Imagine a block with a Spanish title, Tagalog messages in it and the 'more' link being in English, given no translation in the database. How usable is this? Will people in the Philippines find this usable? Will the frequent context switching on the interface between three languages fit their expectations?
I would say that we could develop solutions for such defective (ie. three differently incomplete translations for three languages), but how different is falling back to multiple languages from falling back to one standard, so you know why it happens and you have a predictable behaviour? I bet the above site would be similarly useless for non-English speakers with a partly English interface, and the randomly mixed interface language version.
Of course content is a whole different question, as I have said, we don't have a universal fallback there, and different sites need different models. The Views module will probably need to deal with the more complex situations, and Drupal core will provide one possible way of listings and language display by default.
realistic scenarios
People in Austria speak German ;)...
...though some Germans and Austrians would probably disagree with this statement :)
The 'more' link — and the entire Drupal core — is already translated to Spanish. Basically, it would be a site with Tagalog content and Tagalog-Spanish interface, where Spanish would step in where there's no Tagalog available — typically on admin pages and on the interface of contrib modules.
Yes, we can think of situations when the admin wants to install a little-used contrib module that has no Spanish translation, and in that case the interface will fall back to English. But is this a good enough reason not to let them use Spanish as primary fallback language of the core? And which task is easier - translate a little-used contrib module to Tagalog or translate Drupal core to Tagalog?
Tagalog vs Spanish
How would Spanish be completely available for all modules? Because it is a wider used language then Tagalog? Building a local site, the requirements should set language support given the local conditions. Tagalog should not be that small a minority to dismiss, especially if the site has content in that language.
Let me repeat that the theory is interesting, but displaying more than two languages on the same page (with interface strings) is not a target functionality in priority for now as far as I see. Of course anyone can get in and suggest an implementation, the Drupal source as well as our Drupal fork is open.
Admin-defined fall-back
I'd also like to see the possibility for an admin to define a chain of fallback languages. We've got a webpage where we offer english, german, and 'joke' german, and it would be nice to use german as the fallback language for joke german, and english as the fallback language for german.
There is a simple way to support any number of fallback languages very efficiently: by preprocessing and caching the translation files on the server-side. So, in the above example, we've got an english translation file (which might be autogenerated from the source), a german translation file, and a joke german translation file. Then, to create the real joke german translation file we just take the union of all three, with german having priority over english and joke german over german.
This approach might actually make the translation code simpler, because the translation code can simply assume that the translation file contains a valid translation for everything.
for the interface
This happens for the interface anyway, because this kind of thinking is built into Drupal already. Why would such a concept get useful for user (and/or admin) generated content? There is no fallback content (built into Drupal) in that case.
Can we divide?
Can we divide the content and interface?
We have a problem of misunderstanding -> http://drupal.org/node/282178
spam?
the 2 new comments by oaltawel looks like a spam to me