日本語のようやく:Drupal 8の翻字システムは日本語によって弱いです。漢字の翻字はピン音(中国語の発音)を使って、例えば「新聞記事」は"xinwenjishi"になります!モジュールで改めたいんですが、どの方法がいいですか。チャレンジです。これから英語で…ごめん。
Drupal 8 introduces a transliteration class into core. This class is used for when text in pure ASCII is desirable; for example, the system name of new content types, or the filenames of uploaded files.
The problem is that it's very simplistic; it pretty much can only map one character in the source language to one or more ASCII characters and each character has no context about the characters around it available. Thus, it doesn't work great in a language where 新 can be "atara" or "ara" or "shin" based on the context of characters around it. In fact, the core system as it is doesn't even bother trying using Japanese pronuncuation with Han characters; instead, it just uses the Pinyin (Chinese) transliteration, even if you're telling Drupal to use Japanese. If you try to create a content type titled 新聞記事, Drupal will give it a system name of "xinwenjishi." (It's not great with kana either; "ちょっと待って" becomes "chi_yo_tsutodai_tsute." Wow.) You can test this yourself by running print Drupal::service("transliteration")->transliterate("日本語", "ja"); from Drush or some other method ("ribenyu!").
I've been pondering over ways to improve this with a contributed module. My first thought was to implement a glosser via PHP using a freely-available dictionary such as JMDict. But this was going to be a lot of work, imprecise, and my initial tests were quite slow.
I contacted Jim Breen, one of the original creators of the JMDict project and probably one of the most experienced guys in the field of Japanese translation and transliteration and such, if I could have access to one of the specialized dictionaries he uses to gloss Japanese quite well on his own site,
and if he had any general task for this sort of job. He pointed me towards a project called MeCab, a glossing C library, which you can play with here. (It seems to be pretty solid, except if you do 日本人 or アメリカ人 or any other country-人 - it seems to always transliterate the 人 to ニン.) Of course, the problem with this is that we can't be expecting site admins to have the know-how or ability to be compiling or installing C libraries on their systems. So we'd either have to rewrite it in PHP, or…
Or I could set up a server which used MeCab to do nothing but served transliterations of Japanese text in response to REST queries. The contributed module could just call that server in the back end and return the results. We could set it up on a VPS in Japan so that the audience most likely to use it will get the fastest responses to it. Maybe a couple so they could fall back or round robin. Other OSS projects with similar needs could use it too. The source code to set up this server would also be freely available so organizations that are leery about sending things out to external servers could just set up a server inside their firewall - or maybe even on the same server as the web server and they just chat via a Unix socket connection. Though I don't think this solution is ideal - more moving parts - it might end up being faster to develop and provide higher-quality results than implementing our own glosser in PHP.
Any thoughts?

Comments
No comments, but I'm glad to
No comments, but I'm glad to see you're on it! Good luck, and please keep us posted on your progress.
It is very nice that we can have Transliteration function for Ja
Mr. Garett, Thank you for your useful proposal.
It is very nice that we can have Transliteration function for Japanese in core.
We have discussed transliteration module many times in the past here.
Our team ( ANNAI ) developed original module and published it as Sand box.
(But it is not maintained sufficiently)
Please check the link below.
(If there are some mistakes in English and you cannot understand, please let me know.)
☆jp_kakasi_transliteration
https://drupal.org/sandbox/qchan/1324644
☆jp_mecab_transliteration
https://drupal.org/sandbox/qchan/1324666
Actually, MeCab is fast and available for Apache Solr in the future. So I recommend MeCab if you can setup operating environment.
If you can setup VPS, you can use it easily because Debian like OS and Redhat Clone has the package.
It is difficult for general Shared Hosting Server to install it, so we prepared the module which can be used in the library, Kakashi.
Compared to MeCab, this library is easier to setup and is usually installed to general shared hosting in Japan.
*Shared Hosting Server available for Kakashi.
Operating Experience/Hosting Server
http://pukiwiki.sourceforge.jp/?%E5%8B%95%E4%BD%9C%E5%AE%9F%E7%B8%BE%2F%...
Whether we chose MeCab or Kakashi, we have to find a method to attach it on Drupal and may need to make it web-service in the future as you said.
Now, I still cannot change these module into Drupal8's Class, but it is very nice if we can use transliteration function (Japanese available) in Drupal Core with your help.
I’d like to keep on discussing this matter.
qchan, thanks for the
qchan, thanks for the information. I probably should have looked to see what others in the community had tried before starting this, including searching in Japanese… My mistake.
The code you linked to will be useful for inspiration as I go forward on this. Now, I'm thinking this project can look like the following:
It will still be a lot of work, though having the MeCab PHP library to work with definitely makes things easier. Fortunately, we still have a lot of time before the Drupal 8 release.
Life has been busy recently, but I'm going to try to get the code good enough to start a sandbox project of my own by the end of next weekend.
The Boise Drupal Guy!
Okay, I've pushed some
Okay, I've pushed some working code into a sandbox repository. It's pretty terrible right now, but it works. It at least proves that the idea is sound.
I'll keep working at it. I'm learning a lot of new things about Drupal 8 in the process.
The Boise Drupal Guy!
Here's a progress report so
Here's a progress report so far.
As mentioned above, I've created a Japanese Transliteration sandbox project which features two modules; Japanese Transliteration, and Japanese Transliteration: MeCab. The former does kana transliteration, while the latter does the actual kanji glossing, and could easily be swapped out for one that works with Kakashi or other systems. If you're interested in building one, just copy what the MeCab module is doing and away you go.
I've also created a Remote Transliteration project which takes the client/server idea discussed above and generalizes it so it doesn't depend on Japanese Transliteration and could be used with any site doing transliteration in any language. The project contains separate Server and Client modules.
So to set up a server for Japanese transliteration, download and enable the Japanese Transliteration project and enable the Japanese Transliteration and Japanese Transliteration: MeCab modules. Then download the Remote Transliteration project and enable the Remote Transliteration Server modules. All three of these shouldn't need any configuration to start working. Then, to set up a client on a separate Drupal site, download the Remote Transliteration project and enable the Remote Transliteration Client module. Go to Administration > Configuration > Regional and language > Remote Transliteration Client settings, and enter the URL of the server. That Drupal site should now be using the remote site to get transliteration data.
The modules are still pretty rough and could use some more configuration settings and clean-up, but at least from my own testing, this all works as planned!
Aside from hopefully being of use to the Japanese Drupal community, this has also been useful to teach myself some of the guts of Drupal 8 core.
The Boise Drupal Guy!
How to determine which language?
Thank you for this work!
I am new to Drupal, and found the filenames generated by pathauto (via transliteration) based on the title of nodes are always one of Chinese (Mandarin?) Pinyin representations and am not quite happy with that…
But how will the transliteration module determine which language is used in interpreting the given characters? If used with pathauto, I guess an obvious way is to use the language of the node. Nevertheless if the node is genuinely multilingual, or language neutral, it would not work. Perhaps, it can even detect and guess the language, depending on the context, though it can be seriously hard between Chinese varieties, particularly when the sentence or word is short, like a title of a node…
At least in my work, I've
At least in my work, I've always worked under the impression that the transliteration should be to Japanese; that there would never be a case where a Pinyin translation would be wanted. Yes, circumstances would be more difficult on a multi-lingual site where both Romaji and Pinyin transliterations may be wanted based on context…
The Boise Drupal Guy!
At least in my work, I've
At least in my work, I've always worked under the impression that the transliteration should be to Japanese; that there would never be a case where a Pinyin translation would be wanted. Yes, circumstances would be more difficult on a multi-lingual site where both Romaji and Pinyin transliterations may be wanted based on context…
The Boise Drupal Guy!
Looking around to a solution
Looking around for a solution to this in 2020, and either I found this post before I found the current solution, or this still seems to be an unsolved problem. Is there a better solution now that I'm not aware of. If not, I wonder if we could move the sandbox project forward, maybe get some composer in place to ease the requirements. By thw way, I'm an Idahoan long time resident of Japan, so I guess we have a lot in common.