Japanese to romaji conversion in Tatoeba
I'm using it to convert automatically the Japanese sentences into romaji. But you have to know that the conversion is far from being perfect.
Why can't I edit the romaji?
In the old Tatoeba, I had converted all the Japanese sentences into romaji, saved them in the database and allowed people to correct the romaji generated. But in the new version, I figured it wasn't worth it. Instead I'm just going to provide "on the fly" conversion, so you will not be able to correct a specific romaji sentence.
The reason behind this is that there are more than 150,000 Japanese sentences. I if we do the math and assume that it takes an average of 10 seconds to validate a romaji sentence (validate means read + correct if necessary), that's 1 500 000 seconds spent on validating all the romaji generated by KAKASI. That's about 416 hours... It's not that much if you have a thousand dedicated people fluent in Japanese working for you, the problem can be solved within 30 minutes. But Tatoeba doesn't have so much manpower and it will surely take more than 416 to gather the necessary human resources so we'll try to get the machine do the work.
Japanese to romaji conversion softwares
I haven't tried all the free software out there than allows you to convert Japanese into romaji (actually KAKASI is the only one I tried), but here's a small list. If you know any other free software, let me know.
KAKASI : http://kakasi.namazu.org/
ChaSen : http://chasen.naist.jp/hiki/ChaSen/
I don't think they all convert exactly to romaji. Perhaps some of them only parses the Japanese text (i.e. put spaces where they can potentially be a space) and provide the hiragana. But this is really the most difficult task : to put the spaces at the right place and convert correctly the kanji into hiragana.
Anyways, I'm going to be lazy and stick with KAKASI for now, trying to improve as much as possible the output it generates.
What can be done to improve the romaji output
Surely there can be a better way to fix the romanization, but for now the simplest solution is to analyze the output KAKASI generates, and set rules to replace the wrong romaji with the correct one. This will fix the most recurrent mistakes. For instance ではない is systematically converted into dehanai. So we just set a rule that says : replace "dehanai" by "dewa nai".
(Note : you'll have to understand regular expressions to understand what these lines mean)
Whenever you find something wrong with the romaji generated, just try to figure out what needs to be replaced by what, and let me know. I'll add the rule to the list.
NB : You may want to know what romanization rules are used in Tatoeba.

4 comments: