A modern t()

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by sun on July 8, 2011 at 1:30pm
Last updated by sun on Sat, 2011-07-09 00:45

Use cases:

Translate and format this string. (t())
Format this string, but don't translate. (strtr() + prepared arguments)
Mark this string as translatable, but don't translate it. (hook_menu(), hook_menu_alter(), watchdog(), ...)
Translate and format this string during installation. (st())
Format a plural form of this string. (format_plural())
Translate and format this string in a specific context. (t('View', array('context' => 'noun'))
Translate and format this string in a specific language. (t('View', array('language' => 'hu'))

Problem:

We basically want one function to accept two arguments, but we actually need the two argument values separately in the calling code (e.g., hook_menu()) or delayed and later storage/processing (e.g., watchdog()).
We want the flexibility of using format_plural() also in locations where it's currently not possible (e.g., watchdog()).
The entire thing still needs to be performant/fast.
Still has to be simple (DX).

Idea:

Turn t() into a singleton, returning a chainable object.
Merge different features onto the object as methods.
Use, pass on, and possibly store the t() object, instead of separate 'title' and 'arguments' values; retaining state and all properties.

t('I was here.')->render();

t('%user was here.')->arg('%user', $user)->render();

t('%user was here.')->arg('%user', $user)->formatOnly();

t('%user was here.')->translatable();

t('View')->context('noun')->render();

t('@count item was updated.')->count($amount)->plural('@count items were updated.')->render();

d.o issues:

Comments

interesting

Posted by gábor hojtsy on July 8, 2011 at 2:56pm

Interesting. I'm wondering if arg() in your plan would use an array optionally for multiple arguments. Also, render() does sound like lot more to type...

Anyway, the key thing for extracatability as always is to have clear marks of your strings and their context (as well as their plurals). This looks like having the needed token wrapping for that, so no conter-arguments there. I think it would be great to get wide feedback on this before doing anything with it, but I'm seeing it could be useful.

Also not sure how this would map to the menu and watchdog APIs, you have no examples of that yet.

Simpler version?

Posted by damien tournoud on July 8, 2011 at 3:37pm

Simpler, let's just make the object returned by t() implement __toString() for translation and __sleep() for serialization. The menu system can just store that serialized. That said, I'm unsure about mixing translation and formatting. Those feel like two widely different use cases.

Damien Tournoud

Turn t() into a singleton,

Posted by donquixote on July 8, 2011 at 3:44pm

Turn t() into a singleton, returning a chainable object.

I don't see why it has to be a singleton.
It can very well be a factory / retrieval function with a static cache, plus one or more classes that will be instantiated when you call the factory / retrieval function.
It does not even look like a singleton in your example code.. (probably meant to be under the hood)

Also, render() does sound like lot more to type...

__toString() is your friend.

Merge different features onto the object as methods.

ok with this

Use, pass on, and possibly store the t() object

I like this.
There are a few aspects that deserve furher thinking.
- Store these objects in a rendering array for lazy evaluation, instead of strings.
- Nested t() objects: Arguments (tokens) can themselves be t() objects.
- Shared token arrays. For instance, a bunch of strings with user-related tokens (username, etc), all to be evaluated with the same array of tokens.

->count($amount)->plural('@count items were updated.')

hmm
What if we have more than one token that can be plural?
(I guess this can be solved with nesting in most cases)

plurals

Posted by gábor hojtsy on July 8, 2011 at 3:57pm

We do not currently support more than one token that can be plural either. Don't know how could it be solved. You'd need to present a version of the text for all combinations. Also, plural rules would not be useful for selecting the right one most probably. Think that some languages have 4 different variants for plurals, so if you have two variables that can be plural, they'd need to provide 16 translations for that single string :) Not sure we want to go there.

What about we do this on the

Posted by donquixote on July 8, 2011 at 4:10pm

What about we do this on the translation interface level.
So, the developer would only write
t('@count items were updated.')->arg('@count', $amount);

And then in the translation interface someone would manually add a singular version for the '@count' argument in English.
In other languages, there could be more versions added for the various plural forms.
This stuff would then go into the po file. And yes, this would mean a po file for English then..

"no translation interface"

Posted by gábor hojtsy on July 8, 2011 at 4:19pm

No, there is no translation interface to be involved here. Drupal distributions and modules should be self-sufficient packages to include all their translatable text without a requirement for outside injection. For how the exact variant is selected, we still need a special cased argument. We can use the same argument syntax for it, but we need to always know which single one is it, so we can apply our plural selection algorithm to that.

To clarify, the idea would be

Posted by donquixote on July 9, 2011 at 4:04pm

To clarify, the idea would be to distribute the po file with the module. This means, a contrib developer would have to take care of the po file. Maybe not what we want.

Still, maybe we can improve on the syntax.

<?php
$t_updated = t('@count items were updated.')->singular('@count', 'one item was updated.');
$t_updated->arg('@count', 1)->render();  // renders the singular form
$t_updated->arg('@count', 4)->render();  // renders the plural form

$t_birds = t('@n birds are singing')->ifArgValue('@count', array(
  0 => 'No birds around it seems',
  1 => 'A bird is singing',  // use 'A' instead of '2'
  2 => 'Two birds are singing',  // use explicit number 'Two' instead of '2'
));
$t_birds->arg('@count', 0)->render();
$t_birds->arg('@count', 5)->render();
// etc
?>

What we gain:
- The same prepared thing can be fired with different arguments.
- Plural is detected directly from the argument.
- If the argument value is manipulated, the plural/singular will adapt automatically.

I think it is a good idea to make plural the default thing, and then define exceptions for zero, one or other specific numbers.

language plural rues

Posted by gábor hojtsy on July 9, 2011 at 4:14pm

Here is a wiki page with some plural rules for languages: http://translate.sourceforge.net/wiki/l10n/pluralforms For example, Ukrainian has these rules: nplurals=3; plural=(n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2). nprulars just says how many variants there are and the plural resolves to the right number based on replacement value for n. Pretty simple, right? Now, how would this be reconciled with your added feature of being able to specify English original variants for the specific details? You propose we could set up a different set of rules for English then currently assumed, but then we'll have a different set of source strings to work with (while .po assumes/only allows two) for plurals. How would you then map those to translations which clearly have other (and per-language predefined) rules? How would you mix the code set up rules from your example with these language specific ones?

Ok, let's first assume we

Posted by donquixote on July 9, 2011 at 5:12pm

Ok, let's first assume we only have the English plural hardcoded, and nothing else.
Then let's take the Ukrainian as an example.

nplurals=3;
plural
  = (n%10==1 && n%100!=11)
  ? 0
  : (n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20))
  ? 1
  : 2
;

The translation form for an Ukrainian translator would have a text field for the normal plural form (that's form 2 I guess), and then a button for each could-be-numeric argument.

For instance, on '@n birds are singing', the '@n' is potentially a number.
(maybe we could introduce a placeholder syntax that more easily identified numbers)

You click the button "special plural / singular forms depending '@n' argument".
This reveals more text fields, one for each plural or singular form that is known in Ukrainian.
(the text fields could be revealed by default, if we know for sure that '@n' must be an integer and can be plural or not.

You type in the text variations and save.

The system then needs a sane structure to store this stuff.

Theoretically, we need to remember for each of these additional strings, what plural form they are for and what argument they depend on.
A first-guess approach would be a translation table with columns language, original, arg_token, plural_form, translated_string.

We'd still have a problem with translation strings that contain more than one numeric could-be-plural argument.
No matter what we do, there is no easy solution to these. Either we make the form even more complex, allowing for a multi-dimensional override, or we attempt some kind of string merge. Or we restrict translatable strings to only ever have one could-be-plural argument.
This is all unpleasant, but that's the irreducible complexity of this edge case.
I guess the best thing would be a string merge + optional override, and at the same time try to avoid this edge case as much as possible.

EDIT.
The logic for choosing one of the plural forms depending on the argument would be defined once per language, via some kind of plugins / handlers.

EDIT II.
The plural_form could either hold a string key identifying that plural form, or a number for exact value match, or NULL for the default plural. The number would allow to specify additional overrides for things like 0, 2, 3 in European languages.

EDIT III.
Having this more complex storage, I am not sure how this can fit in po files. Can it?

I guess the best thing would

Posted by donquixote on July 9, 2011 at 5:26pm

I guess the best thing would be a string merge + optional override, and at the same time try to avoid this edge case as much as possible.

The storage for this multi-conditional override would be quite complex I guess..
We would have to build an array of $arg_token => $plural_form that expresses the override condition, then normalize that array (sort by key), then serialize and store it in a "condition" table column.
This would be a separate table then, for the multi-dimensional overrides.

And it should be mentioned, this is a problem that is not yet solved, and that can't have an easy solution.

po files, current features

Posted by gábor hojtsy on July 10, 2011 at 8:03am

Yes, we do not support multiple can-be-plural values now. And I don't think this structure you propose can fit into a .po file. I've not seen a strong need to support this and it seems the required work would way outweigh the gains.

__toString()

Posted by gábor hojtsy on July 8, 2011 at 3:53pm

On relying on __toString(), that would not work when you use t() in most places as it is right now. When you use $list[] = t('view %node')...; it will have the object on the list. When you set formAPI labels, titles, descriptions, it will have the object there, right? Unless you do $list[] = (string) t('view %node')...; which is almost the same to type than render() :)

We should be happy about

Posted by donquixote on July 8, 2011 at 4:01pm

We should be happy about having the object instead of a string.
Having this in the render array allows further manipulation, whereas on a string we can only do some desperate regular expressions.

Interesting

Posted by Crell on July 8, 2011 at 3:58pm

I'm generally in favor of converting complex use case functions into multi-method objects. However, in this case I wonder what the performance trade-off will be. Right now, t() is in the degenerate case (one language site with no replacements) a single function call. With the approach above, it would be a minimum of three (t(), the object constructor, and the render() call). t() is one of the most-called functions in Drupal, so any performance impact there is going to be magnified considerably. Tripling the number of stack calls along the critical path makes me very worried.

True. On the other hand, we

Posted by donquixote on July 8, 2011 at 4:03pm

True.
On the other hand, we might gain performance with the lazy evaluation, if we can skip database requests for strings that are not rendered, or that are discarded.
Or, we could even attempt a multi-lookup, if we have a bunch of t() objects in a render array to be evaluated at once.

It can also help to delay the

Posted by moshe weitzman on July 8, 2011 at 4:13pm

It can also help to delay the t() evaluation such that the whole page context is worked out. Further, our language negotiation could be delayed a bit which might be helpful.

No more delay

Posted by Crell on July 10, 2011 at 4:18am

I am extremely wary about even further delayed behavior. We already delay too much. Every time we delay an operation, we're carrying around extra memory-expensive metadata. We're also making the API that much more complicated.

Also, one of the goals of the WSCCI initiative is to be able to partially-render pages, that is, blocks, in complete isolation. That is mandatory if we want to do proper ESI, or use blocks for AHAH, or implement a Drupal Big Pipe system, or various other things that we want to do. That is, "such that the whole page context is worked out" goes away; the page does not exist as a single context, period. So if WSCCI is successful, delaying rendering a string until we "render the whole page" becomes a logical impossibility.

current delayed execution required

Posted by gábor hojtsy on July 10, 2011 at 9:26am

Currently, hook_menu() and watchdog() require delayed execution of localization because they define the localizable strings not when they are actually displayed. Possibly WSCCI will solve the hook_menu() problem (title and description for menu items). I think watchdog() still looks like a use case.

Terminology clash

Posted by sun on July 10, 2011 at 2:35pm

delayed execution != delayed execution...

hook_menu() (as well as other registries) and watchdog() are typical examples for code that wants to store a state of t(), in order to trigger its actual execution in a completely different request and environment.

The string itself, context, and arguments are retained, but e.g. the target translation language is re-evaluated for the new request environment.
Delayed (page) rendering execution, as partially discussed above, targets the delayed execution within the same server request, and actually happens automatically via __toString() already -- the initial call to t() initializes a new object only; translation, sanitization, and formatting happens when the "string object" is eventually concatenated into the page output at some point.

Or it might even happen not. I wonder how many strings there are that are generated but not used.

Daniel F. Kudwien
netzstrategen

Early render of blocks "in

Posted by moshe weitzman on July 10, 2011 at 6:46pm

Early render of blocks "in complete isolation" does have lots of advantages. One outstanding question in my mind is the role of the theme system. One benefit of render() in the templates is that you only ever render stuff that the themer actually wants on the page. This saves CPU and improves performance (the delay does increase memory usage, as you noted). I guess you are thinking that we do away with delayed render, and force themers to work at an earlier stage like hook_block_list_alter() and hide fields using Manage Fields instead of in preprocess and templates.

Possibly

Posted by Crell on July 12, 2011 at 12:09am

That's one approach. At the moment, I can see blocks returning a renderable array but that array getting rendered immediately rather than waiting for all possible blocks on the page to be built. (EDIT: And by implication, we could opt to have a per-block alter hook invocation rather than a single global hook_page_alter() if we chose to do so.) If we don't keep as large a render object at any given time, that does reduce the peak memory usage (and the CPU usage is, probably, a wash). Whether or not we want to spend some of that memory savings on delaying even more things is, I suppose, a question open for debate.

That said, I am still wary about the DX implications of moving even more into render arrays, however.

Even with render arrays, if

Posted by catch on July 12, 2011 at 1:11pm

Even with render arrays, if you use drupal_render() caching, then the actual stuff that gets rendered isn't passed around in memory - you build a small array with enough metadata to build the content (i.e. the context) + a pre_render callback, then when it comes to render time, the string is either pulled from cache, or the pre_render is executed and goes into a string. There's a few extra bits like #attached, #cache in there as well of course.

The main reason we have these huge arrays in core at the moment is because we don't make any attempt to cache HTML (except in the page cache, and about one block) - whichever way we do that, it is likely to mean starting with a block structure that has the minimum information to build the block (context + callback), then either a cache hit or a cache miss at some point down the line.

Moving everything to the pre_render + cache pattern would mean you end up at the end of the request with a skeleton page array and a tonne of cache_get() / pre_render callbacks to execute.

Blocks v. 4 is likely to mean you start close to the beginning of the request with a skeleton page layout and ... well very close to the same thing just from opposite directions.

Fundamentally we want an absolute minimum amount of metadata in memory - enough to pull the block from cache or handle a cache miss. If either model got fully implemented, then all the t() calls would likely only be executed on those cache misses, so not contribute to memory usage that much. The big shift is going to be in generating HTML that can actually be cached - or at least moving to a model that assumes that it will be.

It doesn't really matter if

Posted by catch on July 11, 2011 at 5:49am

It doesn't really matter if it's a full page or a block - if string processing only happens when the string is actually rendered then you are able to avoid the processing overhead for strings that never get rendered - this works at any level of page building.

If the amount of stuff getting rendered by any request is smaller, then the memory overhead is going to be proportionally less too.

What the memory impact actually might be from this will need profiling, it might be unacceptable, it might be negligible.

On ESI, whatever we're able to do in relation to ESI, it is very likely that core will still be building whole pages in one go a lot of the time. Poor man's ESI with regexps is still one page request. Having varnish make http requests for blocks is a lot of extra http requests especially on cold starts. Anything that relies on daemons is going to be contrib - and that's an area which is still quite young for PHP applications.

So there are definitely CPU/memory/APi trade-offs in all of this, but I don't think it is at all clear what those are yet.

Hybrid?

Posted by dmitrig01 on July 9, 2011 at 12:47am

What about doing some sort of hybrid between how it is now and the proposed solution: use procedural 99% of the time, but if needed (for delayed evaluation and such), use OO. Of course, this would make the API harder, but I'd bet it's worth it.

Like db_query() vs. db_select()?

Posted by Crell on July 10, 2011 at 4:13am

It sounds like you're suggesting something like an optional "translation query builder". Like, t() doesn't change, but we add a translate($string) factory that returns an object that internally calls back to t(). I can see that being useful, although I'm not sure what all of the use cases are as I'm no linguist.

I've got a related question, actually. In Drupal 7, we actually have two very-similar systems: t() and tokens. Both do string-pattern-completion, using a base string that is frequently human-readable and insertion values that sometimes have security implications so need filtering in some cases.

Perhaps those should not be quite so separate systems? (I've no idea if that makes sense; it just occurs to me that there's overlap.)

Yes, exactly

Posted by dmitrig01 on July 10, 2011 at 8:11pm

This approach would allow for the extra flexibility with only minor performance implications

You're on to something. The

Posted by dave reid on July 12, 2011 at 10:57pm

You're on to something. The purpose they're currently separate is that token_scan() is a bit more expensive, otherwise we could have probably built it into t(). Also, the token system doesn't have to be used for strings - like for Pathauto, etc. But I do see the point. It would be nice if we could just pass in token data into t() and it does the token replacement for us. Having token_replace(t('My [special] string.')) is kinda silly.

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

Code

Posted by sun on July 9, 2011 at 12:48am

Started to work on a prototype, and can't really believe that Drupal even bootstraps and mostly works as usual:

A modern t()

Daniel F. Kudwien
netzstrategen

Performance

Posted by sun on July 10, 2011 at 2:06pm

Although testbot claims that Drupal installation fails, the prototype code is already in a functional state and can be applied to an existing site, resembling current functionality.

Someone (dude, where's the thread in comment previews?) raised possible performance issues, and I guess we all agree that would be a total show-stopper. Anyone willing to do some benchmarks/profiling?

Daniel F. Kudwien
netzstrategen

A t() call to rule them all

Posted by plach on July 9, 2011 at 9:39am

I have been playing with the following idea for a while, I ain't sure it's really feasible (mainly for the performance concerns above) or really interesting, but here it is (optional parameters left out for simplicity):

<?php

interface Translatable {
  function translate();
}

class StringObject implements Translatable {
  private $string;

  function __construct($string) {
    $this->string = $this->init($string);
  }

  function init($string) {
    // look up possible alternative version depending on context
    // for instance idiomatic language overrides would go here
    return $string;
  }

  function translate() {
    return; // what t() does now 
  }
}

function t($data) {
  if (is_string($data)) {
    $data = new StringObject($data);
  }
  if ($data instanceof Translatable) {
    return $data->translate();
  }
  return $data;
}

?>

I ain't sure how it plays with what sun is proposing, but this definitely looked the right place to post it ;)

This may not be exactly the

Posted by owen barton on July 10, 2011 at 5:15am

This may not be exactly the same problem being discussed here, but in terms of caching the current t() is still quite far from optimal. Considering more "adaptive" caching strategies it would be good to consider ideas along the lines of http://drupal.org/node/152901 - basically managing content addressable caches with cache ids managed per-path and/or per-user basis. The aim is for this to provide the opportunity to cache all (or almost all of the strings on a page), avoiding lots of DB queries, whilst at the same time keeping the cache targeted and small enough to be memory efficient.

This looks really promising.

Posted by jose reyero on July 10, 2011 at 7:16pm

This looks really promising. Some thoughts:

We should aim at sth more powerful than current t, not at just rewriting t() as an object wrapper. About this I'd like to see how we can better handle bigger multi-line texts. Let's try to imagine something like:

t() ->p('First text line') ->p('Second paragraph')
We could think of the base object not as a translation, but as a string. It is when rendering that string that we should think about translating it or not. I mean something like s() for 'string' may be better than t().
Could we use it to handle user defined strings too? Could it take some string key/name as a parameter somehow?

Zend Translate

Posted by robloach on July 11, 2011 at 5:42am

There is probably some stuff we could learn/take from Zend Translate.

<?php
$translate = new Zend_Translate(
    array(
        'adapter' => 'gettext',
        'content' => '/my/path/source-de.mo',
        'locale'  => 'de'
    )
);
$translate->addTranslation(
    array(
        'content' => '/path/to/translation/fr-source.mo',
        'locale'  => 'fr'
    )
);
 
print $translate-><em>("Example") . "\n";
print "=======\n";
print $translate-></em>("Here is line one") . "\n";
print "\n";
 
$translate->setLocale('fr');
print $translate->_("Here is line two") . "\n";
$translate->plural('Car', 'Cars', $number);
?>

.... etc.

What about a not successfull translation

Posted by hanno on July 13, 2011 at 10:57pm

Maybe another improvement with mondernising t(): it would be interesting if information is returned if the string is translated or is still in English:
See http://drupal.org/node/1165476 (language of parts) and http://drupal.org/node/218021 (highlighting untranslated text)

Comments

Group organizers

Group categories

Content categories

New groups

Group notifications