Improving Japanese transliteration in Drupal 8 D8の日本語翻字を改めて

Posted by Garrett Albright on November 21, 2013 at 7:17am

日本語のようやく：Drupal 8の翻字システムは日本語によって弱いです。漢字の翻字はピン音（中国語の発音）を使って、例えば「新聞記事」は"xinwenjishi"になります！モジュールで改めたいんですが、どの方法がいいですか。チャレンジです。これから英語で…ごめん。

Drupal 8 introduces a transliteration class into core. This class is used for when text in pure ASCII is desirable; for example, the system name of new content types, or the filenames of uploaded files.

The problem is that it's very simplistic; it pretty much can only map one character in the source language to one or more ASCII characters and each character has no context about the characters around it available. Thus, it doesn't work great in a language where 新 can be "atara" or "ara" or "shin" based on the context of characters around it. In fact, the core system as it is doesn't even bother trying using Japanese pronuncuation with Han characters; instead, it just uses the Pinyin (Chinese) transliteration, even if you're telling Drupal to use Japanese. If you try to create a content type titled 新聞記事, Drupal will give it a system name of "xinwenjishi." (It's not great with kana either; "ちょっと待って" becomes "chi_yo_tsutodai_tsute." Wow.) You can test this yourself by running print Drupal::service("transliteration")->transliterate("日本語", "ja"); from Drush or some other method ("ribenyu!").

I've been pondering over ways to improve this with a contributed module. My first thought was to implement a glosser via PHP using a freely-available dictionary such as JMDict. But this was going to be a lot of work, imprecise, and my initial tests were quite slow.

I contacted Jim Breen, one of the original creators of the JMDict project and probably one of the most experienced guys in the field of Japanese translation and transliteration and such, if I could have access to one of the specialized dictionaries he uses to gloss Japanese quite well on his own site,
and if he had any general task for this sort of job. He pointed me towards a project called MeCab, a glossing C library, which you can play with here. (It seems to be pretty solid, except if you do 日本人 or アメリカ人 or any other country-人 - it seems to always transliterate the 人 to ニン.) Of course, the problem with this is that we can't be expecting site admins to have the know-how or ability to be compiling or installing C libraries on their systems. So we'd either have to rewrite it in PHP, or…

Or I could set up a server which used MeCab to do nothing but served transliterations of Japanese text in response to REST queries. The contributed module could just call that server in the back end and return the results. We could set it up on a VPS in Japan so that the audience most likely to use it will get the fastest responses to it. Maybe a couple so they could fall back or round robin. Other OSS projects with similar needs could use it too. The source code to set up this server would also be freely available so organizations that are leery about sending things out to external servers could just set up a server inside their firewall - or maybe even on the same server as the web server and they just chat via a Unix socket connection. Though I don't think this solution is ideal - more moving parts - it might end up being faster to develop and provide higher-quality results than implementing our own glosser in PHP.

Any thoughts?

Comments

No comments, but I'm glad to

Posted by jaypan on November 25, 2013 at 10:08am

No comments, but I'm glad to see you're on it! Good luck, and please keep us posted on your progress.

It is very nice that we can have Transliteration function for Ja

Posted by qchan on November 28, 2013 at 4:05pm

Mr. Garett, Thank you for your useful proposal.
It is very nice that we can have Transliteration function for Japanese in core.

We have discussed transliteration module many times in the past here.
Our team ( ANNAI ) developed original module and published it as Sand box.
(But it is not maintained sufficiently)

Please check the link below.
(If there are some mistakes in English and you cannot understand, please let me know.)

☆jp_kakasi_transliteration
https://drupal.org/sandbox/qchan/1324644

☆jp_mecab_transliteration
https://drupal.org/sandbox/qchan/1324666

Actually, MeCab is fast and available for Apache Solr in the future. So I recommend MeCab if you can setup operating environment.
If you can setup VPS, you can use it easily because Debian like OS and Redhat Clone has the package.

It is difficult for general Shared Hosting Server to install it, so we prepared the module which can be used in the library, Kakashi.
Compared to MeCab, this library is easier to setup and is usually installed to general shared hosting in Japan.

＊Shared Hosting Server available for Kakashi.
　Operating Experience/Hosting Server
http://pukiwiki.sourceforge.jp/?%E5%8B%95%E4%BD%9C%E5%AE%9F%E7%B8%BE%2F%...

Whether we chose MeCab or Kakashi, we have to find a method to attach it on Drupal and may need to make it web-service in the future as you said.
Now, I still cannot change these module into Drupal8's Class, but it is very nice if we can use transliteration function (Japanese available) in Drupal Core with your help.
I’d like to keep on discussing this matter.

qchan, thanks for the

Posted by Garrett Albright on December 3, 2013 at 1:06am

qchan, thanks for the information. I probably should have looked to see what others in the community had tried before starting this, including searching in Japanese… My mistake.

The code you linked to will be useful for inspiration as I go forward on this. Now, I'm thinking this project can look like the following:

A base "Japanese Transliteration" module which does improved katakana-to-romaji transliteration and can provide an interface for other modules.
A "Japanese Transliteration: MeCab" module which will use MeCab to do kanji/kana-to-katakana glossing. Those results are then passed to the base "Japanese Transliteration" module for transliteration.
Later, a "JT: MeCab server" module which can provide MeCab results via REST (eventually we may be able to do this without bootstrapping Drupal, but at first it will probably do so), and a "JT: MeCab client" module which could connect to those clients for those that cannot install MeCab locally.
At some point, a "JT: Kakashi" module could be created as well, or others for other glossing systems, which integrate with the base "Japanese Transliteration" module.

It will still be a lot of work, though having the MeCab PHP library to work with definitely makes things easier. Fortunately, we still have a lot of time before the Drupal 8 release.

Life has been busy recently, but I'm going to try to get the code good enough to start a sandbox project of my own by the end of next weekend.

The Boise Drupal Guy!

Okay, I've pushed some

Posted by Garrett Albright on December 10, 2013 at 7:12am

Okay, I've pushed some working code into a sandbox repository. It's pretty terrible right now, but it works. It at least proves that the idea is sound.

I'll keep working at it. I'm learning a lot of new things about Drupal 8 in the process.

The Boise Drupal Guy!

Here's a progress report so

Posted by Garrett Albright on December 24, 2013 at 6:52pm

Here's a progress report so far.

As mentioned above, I've created a Japanese Transliteration sandbox project which features two modules; Japanese Transliteration, and Japanese Transliteration: MeCab. The former does kana transliteration, while the latter does the actual kanji glossing, and could easily be swapped out for one that works with Kakashi or other systems. If you're interested in building one, just copy what the MeCab module is doing and away you go.

I've also created a Remote Transliteration project which takes the client/server idea discussed above and generalizes it so it doesn't depend on Japanese Transliteration and could be used with any site doing transliteration in any language. The project contains separate Server and Client modules.

So to set up a server for Japanese transliteration, download and enable the Japanese Transliteration project and enable the Japanese Transliteration and Japanese Transliteration: MeCab modules. Then download the Remote Transliteration project and enable the Remote Transliteration Server modules. All three of these shouldn't need any configuration to start working. Then, to set up a client on a separate Drupal site, download the Remote Transliteration project and enable the Remote Transliteration Client module. Go to Administration > Configuration > Regional and language > Remote Transliteration Client settings, and enter the URL of the server. That Drupal site should now be using the remote site to get transliteration data.

The modules are still pretty rough and could use some more configuration settings and clean-up, but at least from my own testing, this all works as planned!

Aside from hopefully being of use to the Japanese Drupal community, this has also been useful to teach myself some of the guts of Drupal 8 core.

The Boise Drupal Guy!

How to determine which language?

Posted by alpiniste on October 16, 2014 at 1:05pm

Thank you for this work!
I am new to Drupal, and found the filenames generated by pathauto (via transliteration) based on the title of nodes are always one of Chinese (Mandarin?) Pinyin representations and am not quite happy with that…

But how will the transliteration module determine which language is used in interpreting the given characters? If used with pathauto, I guess an obvious way is to use the language of the node. Nevertheless if the node is genuinely multilingual, or language neutral, it would not work. Perhaps, it can even detect and guess the language, depending on the context, though it can be seriously hard between Chinese varieties, particularly when the sentence or word is short, like a title of a node…

At least in my work, I've

Posted by Garrett Albright on October 22, 2014 at 4:36am

At least in my work, I've always worked under the impression that the transliteration should be to Japanese; that there would never be a case where a Pinyin translation would be wanted. Yes, circumstances would be more difficult on a multi-lingual site where both Romaji and Pinyin transliterations may be wanted based on context…

The Boise Drupal Guy!

At least in my work, I've

Posted by Garrett Albright on October 22, 2014 at 4:36am

The Boise Drupal Guy!

Looking around to a solution

Posted by ultrabob on October 29, 2020 at 4:22am

Looking around for a solution to this in 2020, and either I found this post before I found the current solution, or this still seems to be an unsolved problem. Is there a better solution now that I'm not aware of. If not, I wonder if we could move the sandbox project forward, maybe get some composer in place to ease the requirements. By thw way, I'm an Idahoan long time resident of Japan, so I guess we have a lot in common.

Reviving this old thread —

Posted by u7aro on July 21, 2026 at 8:53am

Reviving this old thread — this problem finally has a maintained contrib solution.

For anyone who lands here (like ultrabob did in 2020) still wondering whether Drupal's core transliteration ever got better for Japanese: it doesn't out of the box (日本語 still becomes "ribenyu"), but I've published a module that fixes it.

Japanese Transliteration: https://www.drupal.org/project/japanese_transliteration

It follows the same direction Garrett laid out back in 2013. Rather than mapping character-by-character, it analyzes Japanese text into morphemes and romanizes each morpheme's actual reading, then hands the result to Drupal's core transliteration service. Because it hooks the core service, everything that romanizes through it — machine names, file names, Pathauto URL aliases — benefits automatically with no per-feature setup.

A few notes:

Morphological analysis is done via the JNLP module, so you can choose Sudachi, MeCab, or Igo-php as the analyzer — the MeCab path picks up right where the original discussion left off.
Choice of romanization system (Nippon-shiki default, Kunrei-shiki, or Hepburn), ASCII-only output.
Configurable long-vowel handling (コーヒー → "koohii", or トウキョウ → "tokyo").
It also extends the client-side machine-name preview (Drupal 10.3+) so the correct Japanese reading shows while you type.
Chinese-language content is left untouched.

Current release targets Drupal ^11.1. Issues and merge requests are welcome on the project's issue queue. Thanks to everyone in this thread — the 2013 discussion was genuinely useful background.

Great!

Posted by ultrabob on July 21, 2026 at 5:02pm

U7aro, this is great. I had ended up implementing something similar (I think I used the kuroshiro NPM package), but was never able to find the time to generalize it for release, but yours takes it further. Thanks for this work. I'll point the maintainers of the site that use my solution toward your implementation.

Improving Japanese transliteration in Drupal 8 D8の日本語翻字を改めて

Comments

No comments, but I'm glad to

It is very nice that we can have Transliteration function for Ja

qchan, thanks for the

Okay, I've pushed some

Here's a progress report so

How to determine which language?

At least in my work, I've

At least in my work, I've

Looking around to a solution

Reviving this old thread —

Great!

日本コミュニティ: Drupal Japan User Group

Group organizers

Group categories

Topics

New groups

Group notifications

Hot content this week