User provided vs. code provided translatables and translation sets

Posted by gábor hojtsy on May 24, 2011 at 5:46pm

In my previous post titled Drupal's multilingual problem - why t() is the wrong answer posted on my blog and on groups.drupal.org for feedback, I've detailed issues with using t() as a translation tool for "user provided data". This post goes into some further details, a discussion of current solutions which could form basis for discussion of future solutions.

How can we even tell the difference between code and user provided translatables?

It is fair to assume that many multilingual sites will not have English as their default language (many not even as any of their supported languages), so we cannot assume that blocks, menus, and so on are entered in English. However, source code based strings are considered part of the user interface, and as such assumed to be written in English. What does this has to do with default configurations set up by modules and How do we reconcile this with the growing popularity of exportables and features (as in Feature module generated versioned export packages)? Let's look at these two questions.

Preset configuration from distributions and module installs

When you set up a site localized with .po files under your Drupal source tree per the Drupal instructions, you'll get your default "user provided" (preset) configuration localized. Most install files use t() or st() before they insert their data to the Drupal tables. Therefore default content types, admin shortcuts, etc. are saved in the language you install Drupal in (except some current bugs). This is very nicely in line with how Drupal assumes all your user provided data is in the site default language, and the assumption is that you'll keep building out your site in that language going forward. This is sort of an issue if you enable a module using an admin language that is different to the site default, and the module adds default configuration using translation functions. That default configuration will be saved in the active language and go against our assumption. To fix this, we could always pass the site default language to the translation functions in install routines. Granted, what's an install routine is not always easy to tell and API functions are used in situations, not all of which might be language aware. I think these can and should be hunted down on an individual basis.

I also think that every piece of configuration, like a menu item, your site's name or a contact form category should know the language it was entered in. This definitely needs a lot of work and even Drupal 7 can be augmented in some (limited) ways to make this a property of each configuration component universally.

Exported configuration

Let's consider a more interesting question, exported views. (Because you are probably familiar with the situation, not that Views would be anything special compared to other exportables). When you run an exported view from code, it sounds like the view should have t() calls to display the labels, empty text, headers, etc. translated when displayed in different langauges. This sounds like a desired format for exported features, so they can support multilingual use. It is all code after all, so t() is best, right? In these cases, we then would mandate that the exported configuration is in English. Far stretch? Maybe. Well, if it is not English, the export should definitely not include t() calls at all. Once we have each configuration component (such as a view) know which language it is in like discussed above, we can tell whether to export it either way.

When the user overrides the view however, it should always be imported in the database in the site default language in Drupal 7 at least, and its runtime value not run through t() anymore, because now it became user edited/editable, and all the usual permissions and workflows for editable configuration should apply. Suddenly we have a specific language requirement, and we should store with the view the language it was saved with, and make it translatable as configuration, not as user interface.

Now this is a pretty big difference. For your code based exportable, we assumed it would ideally use t() and be written in English and therefore translatable on the user interface translation screens. However, as soon as you override it, you'd go to the views UI to translate it there as configuration. How should we avoid this mess?

Well, there are a couple options. We can let source code text be written in non-English, but then the translation call should specify which source language it was written in. That still does not solve the problem of changing workflows, and all the rest of the permission, editing and other issues I've covered in my previous post.

So to be able to handle this consistently, a general rule we can deduct is that any user editable data should be considered as if it was user provided. Even if it is sourced from code. Then we can use the same user interfaces we'd use if the data was provided on the UI to start with to provide translations (which can of course come from code as well).

Application in practice

Unfortunately there are a few pieces missing compared to the ideal scenario.

We need to be able to tell the langauge of each configuration object. This is implemented for some use cases in the Internationalization module suite, but I think it needs a rethink in how it is applied.
When objects are exported, the language of the object should be exported with it.
When objects are used directly from their exports, instead of using t() for translation, they should use configuration translation APIs (currently only available from Internationalization module) to translate their pieces. This is not at all implemented that I know, due to the exports then being dependent on APIs of other contributed modules. Exports should be possible with and without these APIs for single language use and code-based multi-language use eventually I think.
When an object is imported to the database, a certain language version should be imported. Once we know the language of each object, there is no requirement to save them in the site default primarily, but that sounds like the ideal approach for site builder's sanity. The default editing UI for the object would show that language, so it makes most sense to use the site default language.

For the exportables, this only requires that they are uniquely identifiable for the configuration translation APIs, which needs machine names for them, instead of incremental IDs, which is already a requirement for sane exportable implementations anyway.

Expanding beyond simple object property translation

Now most contributed modules only focus on simple object translation if they care about translation at all. However, there are three scenarios for foreign language sites which are generally considered when building the pieces, as I've enumerated before in multiple posts, most recently in my post on blocks and textgroups:

Being able to mark an object as in one language. With node translation this was achieved by language enabling nodes.
Being able to mark an object as in one language and relate it to others as being a translation set. For nodes, this is supported by Drupal core's content translation module.
Finally, being able to translate pieces of the object that need translation and leave the rest alone. Load the right language variant of the object dynamically as needed. In the case of nodes, this is achieved with the contributed entity_translation module (formerly translation.module).

Now, the previous post on the dangers of t() and this post only considered the first and the third scenario so far. We discussed that each object should have their language associated as property and discussed some difficulties in handling code based configuration vs. UI based configuration, and concluded we should treat them the same.

The second scenario applied to generic Drupal modules however, is about limiting certain configuration objects to certain languages, and organizing them into sets. Think about having a menu tree in one language and another tree in another, and when you switch languages, the menu trees should switch too. You need to have sets for your primary menus, secondary menus, etc. This does apply to a diverse set of objects, but clearly not to all types. Having a content type and a different content type for other languages sounds a bit far fetched.

For Drupal 7, this is implemented on a one-on-one basis for some core objects by the Internationalization module, and as said, for most other contributed modules, they are just ignorant of the possibility. I don't know how could we build a generic API for that scenario with the diversity that there is for Drupal 7 data structures. With the Drupal 8 Configuration Management Initiative in full swing though, it looks like the current proposal is to redo all user configuration pieces under a common API, which could make it possible for us to do object property translation as well as object sets as translations in a universal way. I've asked translation related questions on the proposal, to be considered as it is being worked out. You can help there too by validating the approach with your translation needs.

While I promised to do a run-down of the current i18n_string approach to object property translation, I think there is plenty to discuss here, so I'll post about that next instead. What do you think? Please share your opinion in the comments.

Comments

Translation UI

Posted by drifter on May 25, 2011 at 7:58am

Great summary. While not directly related to this post, this inspired me to think of a possible approach to the UI. Usually, if something is not translated, or is mistranslated, you'll notice it on the frontend. The localization client module does a good job of translating t() strings from the frontend.

Perhaps we could do something similar for user defined strings. See the linked image:

https://skitch.com/drifteaur/fb8aa/mdac.info

A dashboard shortcut could activate "string translation mode" for a page. Any strings passed through tt() (I'm using i18nstrings as an example, could be something else for Drupal 8) while in string translation mode would get translation links added, which could show up similar to how contextual links are shown in Drupal 7. Perhaps different colors indicating translation status (translated/untranslated).

Clicking on a translation link could open a dialog allowing translation. My example shows multiple languages, maybe a better option would be to just translate the current site language.

Anyhow, just an idea to get things going.

contextual links

Posted by gábor hojtsy on May 25, 2011 at 8:23am

Well, we can definitely add translation links to contextual links (we already did so for certain objects: http://drupal.org/node/1114602#comment-4458610).

Putting concrete translation links on text items is pretty hard. When we generate our output from a translation function, we don't know if it is going to be check_plain()-d or it is going to be put into a title="" attribute of a link or an alt="" of an image, etc. It is very easy to break the resulting HTML if we try and generate extra spans or divs around the translation. This is being discussed for the past two years in http://drupal.org/node/218021, feel free to follow up there if you have concrete implementation ideas to make this work.

First I'd like to dare one

Posted by jose reyero on May 28, 2011 at 9:41pm

First I'd like to dare one more assumption about languages and Drupal code and that is that strings in source should be always written in English.

Say I am building a website with a few custom modules for my Spanish client. As sometimes we don't even need English we are not going through the 'write string in English / translate / update string / retranslate / once again...' workflow. What I'd be doing is writing the UI strings in the source code straight in Spanish saving my client's time and money by doing so.

The result? One module that is not contributable at all because it has hardcoded strings in Spanish. Then, we all loose. Because I cannot tell my client that they'll get anything by taking the work or fixing strings and contributing the module without some extra translation work. And we all know how translation workflow works. Translate / edit / retranslate, etc..

How to fix this? We basically have two options: either making it possible to add some explicit language to my source code strings, overriding the 'only English' assumption or by completely taking strings out of sorce code.

Option one would go like:

<?php
print t('Hola mundo', array('source' => 'es'));
?>

Option two would be like

<?php
print string('mymodule_hello_world');
?>

And then having a separate file, say 'strings.es' where 'mymodule_hello_world' => '¡Hola mundo!' (This is the approach taken by other languages like Java so we are not inventing anything new here)

Both options, btw, would fix our issue with exportable strings. Everything in code would be ok as long as a) It is in English or b) It has a explicit language.

removing strings

Posted by gábor hojtsy on May 29, 2011 at 8:58am

I think for contributability to the community, I think removing strings from source code is best for this use case, because it puts all languages on the same level. If you include Spanish defaults in the source code, then the code will not be readable much for English speakers. However, I think we thought "the community" would have strong feelings against using string identifiers because (a) there are lots of strings in a module (b) it reduces module usability anyway. Not sure if these historic assumptions actually stand (we should ask "the community" about them IMHO), and because the goal of better multilingual support should help people stand behind a new solution hopefully :)

Agree about removing strings

Posted by jose reyero on May 29, 2011 at 7:47pm

Agree about removing strings from source, which could also make everything more consistent, specially if we are considering this, http://groups.drupal.org/node/149984#comment-507499

Also I don't think people would be very happy with modules in mixed languages all around.

Anyway, just in case, for short strings, a string id may very well be the same as the english string, as long as we add some 'module' parameter.

t('mymodule', 'create story')

This would be good enough as long as this short string changes are also part of update scripts, for which we should provide some kind of update_source_string('mymodule', $old, $new) function.

This could be compatible with writing modules for other languages as generally you don't even want to stick long texts into the code. So a module with readable english string ids, even if it's missing the 'real' English strings is a first step to make it contributable. Someone just needs to add the english strings in a different file and your original module will be still working.

As a side effect, this would fix two other different issues:
- String overrides, which is still an ugly hack. Strings would overridable for English too from the get go.
- String context, which we wouldn't need anymore, with the small price of strings needing translations for each module but the big advantage of all strings having context (per module) as default.

Still we need to figure out how this would play out with our current translation workflow. I think something like this could work well:
- All strings will be imported into the db on module install. They can be updated by update scripts.
- Then, since all strings are in the db, all the strings are overridable/translatable on a consistent way.
- We don't have anymore differences between 'hardcoded' strings and 'user defined' strings. The only one is that hardcoded strings are originally included with the module code.

Other Pros:
- Code size would go down a lot in some cases. No more code patches to fix strings.
- Improved consistency on string handling. Better usability for translators/regular users. (Right now they need to figure out why some strings are translatable and some other aren't, or why they need string overrides at all for some strings)

Cons:
- Let our handy t() function die. ?
- Worse code readability (but shorter code).

string context, per-module strings

Posted by gábor hojtsy on May 30, 2011 at 7:59am

This sounds good at first, but I think needs some revisions still. Two counter-points:

for each string to be tied into its module will kill very practical sharing use cases like for the string "Operations": http://localize.drupal.org/translate/source-details/398 (see, hundreds of module releases share this translation, which is very useful for translators not do always all over again) - sometimes it is useful to have separate module translations, but I think context is better for that, instead of siloing module strings per module
on context, the module name is definitely not replacing context as we need it (unlike your suggestion); think the views.module could use 'view' as the verb and 'view' as the noun, context is really not limited a module level, and it should be related to concepts not module names to be useful IMHO

Ok about context, agree it

Posted by jose reyero on May 30, 2011 at 10:12am

Ok about context, agree it can still be useful.

But the ability to define strings per module can also help. About redundancy, what we need here is a good fall-back approach and also 'shared module names' (i.e. views_ui using the same 'module' for the strings a views, and all the rest views_xxx modules possibly too). But there can be other 'views' that are names and may need different translations.

Then, once we have properly named strings, the following is also possible, nesting strings so the same string is always named/translated consistently across all the code:

'Administer your [views.view.name]' (This for Drupal 9 maybe :-) )

We could also build on the context idea and add optional 'module' and 'string name' parameters there. This could ease the transition and we could get started with longer strings (help texts, etc) which is where the benefits are easier to see.

Btw, I had this (not really the same, but some start) patch posted some time ago, http://drupal.org/node/365934

Summarized our discussion

Posted by gábor hojtsy on June 8, 2011 at 2:00pm

Summarized our discussion and started a thread at http://groups.drupal.org/node/154394 about this. I think it is a good place to get some more feedback on this idea. We really need this better defined and have people who want to implement it and can execute on it, if it is about to happen anytime.

I would definitely like to be

Posted by andy inman on September 29, 2012 at 5:11pm

I would definitely like to be able to use string tokens. Quick suggestion, since t() already accepts an array of arguments, extended it so that, t($string_token, $args) can specify that the string represented by $string_token should be looked up for the current language (I do realise that you can do this already by messing with language settings and string translations.)

To address string context, per-module strings ...

When current language is Spanish and we need to lookup the string identified by '_OPERATIONS':

Lookup '_OPERATIONS' in Spanish -> Use the result if available, otherwise:
- Lookup '_OPERATIONS' in English -> 'operations' (ALL tokens are defined for English)
- Lookup 'operations' in English -> 'operaciones' - use this translation.

Currently part of the team at https://lastcallmedia.com in a senior Drupal specialist role.

more on string tokens

Posted by drifter on May 30, 2011 at 10:41am

Just thinking more about string tokens. Further cons:

It adds a layer of abstraction, and for many people, might make writing modules harder. Having to think up string tokens, and then export them, open another file, edit makes it more tedious, encouraging people to just skip using t() altogether
Having the actual strings makes debugging easier: if I'm having trouble with a module, I usually grep for some string I found in the UI. If they were seperate, I'd need to search for it in some translation file first, and then search for the corresponding string token, again making things more difficult

On the plus side:

minor edits to a string wouldn't break all the translations. If there's a major change in the meaning of the string, the token could be changed too, signaling that translations need to be updated.

About modules as contexts:

I think contexts should be linguistic, not technical. This can be painful in i18nstrings for example, where I'd need to translate the term "News" many times, once as a content type, as a taxonomy term, as a menu item, as a block title etc. etc., where they all mean the same thing. There are lots of common interface strings like "Save", "Cancel" etc. - I can see per-module context as being useful sometimes, but there should at least be a fallback mechanism, if no translation exists in the current context, but it does exist in another, it should be used.

(And unrelated, module functions should really know about the module they're in. Writing t('modulename', 'somestring') inside of a module just feels wrong, that t() should know somehow where it's being called from... though the only thing I can think of are global variables...)