Sunday, December 13, 2009

Tatoeba update (Dec 12th, 2009)

This may be the last "big" update for 2009. There isn't really any new feature contrary to the release we did a few weeks ago, but there are some important changes.

Creative Commons Attribution license

Tatoeba is a project that collects sentences, and we are very nice so we redistribute those sentences for free to the rest of the world -- and that's over 300,000 sentences.

All these sentences do not come from nowhere. They are based on the Tanaka Corpus, a corpus compiled by professor Yasuhito Tanaka at Hyogo University in Japan. Since professor Tanaka released his data under the public domain, I thought I would leave also Tatoeba's data in the public domain as well, which is what I did until recently... But guess what, I'm not allowed to do that. I don't want to get into details but as a French citizen, I cannot legally put my own work under the public domain (just like any other French or European contributor in Tatoeba). That's just how the law is (c.f. Wikipedia for those who can read French).

Now there is something called CC0 that could potentially be used to get closer to the public domain. But my team and myself are not big specialists on this topic, and we are not sure if it is (legally) safe for us to use this license. So until then, we will redistribute the data under CC-BY. If this is a real problem for you, please let us know. Not that I will be able to find a solution to your problem, but at least I will be aware of what type of problems the CC-BY license can involve.

Tatoeba, new home for the Tanaka Corpus

For those of you who are learning Japanese, you certainly have (at least) heard of Jim Breen's WWWJDIC. The Tanaka Corpus was initially edited and maintained over there, and most of it was done by Paul Blay. But around november or december last year (2008) he decided to pull out of this task.

Paul Blay was also an active contributor in Tatoeba and we made sure that the content in Tatoeba related to the Tanaka Corpus was synchronised with WWWJDIC's version.

Ever since Paul Blay left, there hasn't really been any work done on the corpus. Recently, I suggested to Jim Breen to use Tatoeba as the new platform to maintain the corpus, to which he agreed. So I can at least announce to those who are willing to improve the Tanaka Corpus : Tatoeba is now the place to go.

ISO 639 alpha-3

This isn't going to have much impact on you unless you are planning to use Tatoeba's data: we're updating the language codes to ISO 639 alpha-3.

What next?

One thing is for sure, we won't be introducing new big features until past mid-January. Most of the members of the team are still students, and are going to reach soon the (so-much-loved) end of their semester (me included). We won't have the time to test and debug properly before our final exams are over.

As for what we are planning exactly, I'll write more about this in a next post.