Saturday, February 7, 2009

Tools for Japanese romanization

Japanese to romaji conversion in Tatoeba

I have recently re-implemented KAKASI, a little tool that was present in the old Tatoeba and that can convert Japanese into romaji or furigana. You can find a "Romaji & Furigana" link to this converter at the bottom of Tatoeba website, along with "Contact", "Tatoeba Blog" and "Downloads".

I'm using it to convert automatically the Japanese sentences into romaji. But you have to know that the conversion is far from being perfect


Why can't I edit the romaji?

In the old Tatoeba, I had converted all the Japanese sentences into romaji, saved them in the database and allowed people to correct the romaji generated. But in the new version, I figured it wasn't worth it. Instead I'm just going to provide "on the fly" conversion, so you will not be able to correct a specific romaji sentence.

The reason behind this is that there are more than 150,000 Japanese sentences. If we do the math and assume that it takes an average of 10 seconds to validate a romaji sentence (validate means read + correct if necessary), that's 1 500 000 seconds spent on validating all the romaji generated by KAKASI. That's about 416 hours... It's not that much if you have a thousand dedicated people fluent in Japanese working for you, the problem can be solved within 30 minutes. But Tatoeba doesn't have so much manpower and it will surely take more than 416 to gather the necessary human resources so we'll try to get the machine do the work.


Japanese to romaji conversion softwares

I haven't tried all the free software out there than allows you to convert Japanese into romaji (actually KAKASI is the only one I tried), but here's a small list. If you know any other free software, let me know.


I don't think they all convert exactly to romaji. Perhaps some of them only parses the Japanese text (i.e. put spaces where they can potentially be a space) and provide the hiragana. But this is really the most difficult task : to put the spaces at the right place and convert correctly the kanji into hiragana.

Anyways, I'm going to be lazy and stick with KAKASI for now, trying to improve as much as possible the output it generates. 


What can be done to improve the romaji output

Surely there can be a better way to fix the romanization, but for now the simplest solution is to analyze the output KAKASI generates, and set rules to replace the wrong romaji with the correct one. This will fix the most recurrent mistakes. For instance ではない is systematically converted into dehanai. So we just set a rule that says : replace "dehanai" by "dewa nai".

The whole list of rules can be found here : 
(Note : you'll have to understand regular expressions to understand what these lines mean)

Whenever you find something wrong with the romaji generated, just try to figure out what needs to be replaced by what, and let me know. I'll add the rule to the list.

NB : You may want to know what romanization rules are used in Tatoeba.

8 comments:

  1. Hi Trang

    Thanks for the article.

    I'm trying to run kakasi myself to do a similar thing - what parameters are you using when doing the conversion?

    Thanks
    K

    ReplyDelete
  2. Gosh, sorry for the delay. I never thought people would actually post comments on this blog...

    Well, I suppose you've figured out by now, but for the next ones who stumble upon this page and wonder, I'm using those parameters for the romaji :

    -Ja -Ha -Ka -Ea -s

    And those for the furigana :

    -JH -s -f

    ReplyDelete
  3. I am using Kakasi with JMDict here: http://kotoba.tremicom.com but I am concerend that it is not the most accurate tokenizer. All of the in depth articles I have found which compare the ones mentioned here are all written in Japanese which I don't understand. If anyone can list their experience with these, it would be a big help.

    ReplyDelete
  4. The main problem I have with the romanization is the odd placing of particles. The other is misleading breaks. For example, in the sentence 時々彼はまるで私の上司のように振る舞う we get "tokidoki kare wa marude watashi no joushi noyouni furu mau". That "noyouni" should be "no youni" and the "furu mau" should be "furumau" as it's treated as a single verb. I just put that sentence through MeCab and
    got (not in ローマ字 of course) "tokidoki kare ha marude watashi no joushi no youni furumau", which fixes all those problems. I recommend you look at using either Chasen or MeCab instead of porr old Kakasi.

    Jim

    ReplyDelete
  5. Hi ...Hmm...
    I made a program and want you to put in the list above : JapWrite (http://ahmadmdev.yolasite.com)

    now it can only "romanize" the 'Kana' letters not the kanji yet.

    I also have A suggestion and don't know where to put it ...I'll try the latest blog

    ReplyDelete
    Replies
    1. I'll try to do Kanji ,but didn't find a way yet ..

      Delete
  6. Thanks for this valuable help. Maybe now I can improve my own lousy converter. Note: the sedlist link is broken.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete

Note: Only a member of this blog may post a comment.