Sunday, December 13, 2009

Tatoeba update (Dec 12th, 2009)

This may be the last "big" update for 2009. There isn't really any new feature contrary to the release we did a few weeks ago, but there are some important changes.

Creative Commons Attribution license

Tatoeba is a project that collects sentences, and we are very nice so we redistribute those sentences for free to the rest of the world -- and that's over 300,000 sentences.

All these sentences do not come from nowhere. They are based on the Tanaka Corpus, a corpus compiled by professor Yasuhito Tanaka at Hyogo University in Japan. Since professor Tanaka released his data under the public domain, I thought I would leave also Tatoeba's data in the public domain as well, which is what I did until recently... But guess what, I'm not allowed to do that. I don't want to get into details but as a French citizen, I cannot legally put my own work under the public domain (just like any other French or European contributor in Tatoeba). That's just how the law is (c.f. Wikipedia for those who can read French).

Now there is something called CC0 that could potentially be used to get closer to the public domain. But my team and myself are not big specialists on this topic, and we are not sure if it is (legally) safe for us to use this license. So until then, we will redistribute the data under CC-BY. If this is a real problem for you, please let us know. Not that I will be able to find a solution to your problem, but at least I will be aware of what type of problems the CC-BY license can involve.

Tatoeba, new home for the Tanaka Corpus

For those of you who are learning Japanese, you certainly have (at least) heard of Jim Breen's WWWJDIC. The Tanaka Corpus was initially edited and maintained over there, and most of it was done by Paul Blay. But around november or december last year (2008) he decided to pull out of this task.

Paul Blay was also an active contributor in Tatoeba and we made sure that the content in Tatoeba related to the Tanaka Corpus was synchronised with WWWJDIC's version.

Ever since Paul Blay left, there hasn't really been any work done on the corpus. Recently, I suggested to Jim Breen to use Tatoeba as the new platform to maintain the corpus, to which he agreed. So I can at least announce to those who are willing to improve the Tanaka Corpus : Tatoeba is now the place to go.

ISO 639 alpha-3

This isn't going to have much impact on you unless you are planning to use Tatoeba's data: we're updating the language codes to ISO 639 alpha-3.

What next?

One thing is for sure, we won't be introducing new big features until past mid-January. Most of the members of the team are still students, and are going to reach soon the (so-much-loved) end of their semester (me included). We won't have the time to test and debug properly before our final exams are over.

As for what we are planning exactly, I'll write more about this in a next post.

Saturday, November 28, 2009

How can Tatoeba be useful for language learners?

In response to byzantinist who asked on Twitter :
How can Tatoeba be useful for language learners, other than learning by adding sentences or using the sentences in SRS?
Honestly, at the current state of the project, there isn't much more than that.
There are no grammar explanations, there are no lessons, there's not much people to ask for help, there's not much data in languages other than English and Japanese, and the data is not even always reliable... Generally speaking, you can use Tatoeba as a complement to your language learning, but by itself there is no way it will teach you a language. (Yea, I know, I'm not very good at marketing :D)

But the project has not reached its full potential yet. My team and I have a lot of ambition, we just don't have a lot of free time...

To give you a better idea of the context, when I started the project I was very frustrated with online dictionaries and I had this vision of a "dictionary" in which whatever you are searching for, it will always provide you with results. Most importantly, it will always provide you with example sentences (and their translations). That would be useful, right?
And I felt : why can someone make a collaborative encyclopedia (c.f. Wikipedia), but no one is trying to make a collaborative "dictionary"? Because obviously you can't build a "dictionary" like this unless you have at least thousands of people who are supporting your vision and who are willing to contribute.
To make a story short, I spent a few years building the tool I envisioned (at least its core). Lately some other people got involved and are helping me make this project grow faster. And now, we're reaching a phase where we really need a community, and preferably a smart one... I mean, if we cannot gather a bunch of dedicated and knowledgeable people to participate, then the project is going to remain useless despite its huge potential.

Basically, at the moment, the project is focused on building a community and gathering/organizing data, more than on integrating language learning features (although we would love to). Because there is no point building a language learning application if you don't have any good data to use... And the sad truth is that it's kind of difficult to find good data. So the concept is : we gather a lot of data, try to organize it, ensure it is of good quality and make it freely accessible, downloadable and redistributable, so that anyone who has a great idea for a language learning application (or a language tool) can just focus on coding the application and rely on us to provide data of excellent quality. And again, all of this is only possible with a strong community...

However, if you have any specific ideas of features that would make your language learning experience better, just let us know and we'll be glad to work on it. I mean, the project is constantly evolving. As far as we're concerned, we're not going to stop integrating new features anytime soon.

Sunday, November 8, 2009

New release!

There has been a lot of things going on in Tatoeba within the last couple of months. I never took the time to blog about it, but since we are planning a new release for next week, perhaps it's time I communicate a bit more...

First of all, note that I said "we are planning" and not "I am planning". Indeed, in September (and even a few months before that) the project took another dimension, for I was not alone anymore!
Of course, I never really was alone, but I was - until then - pretty much the only one to implement the next new features, the next improvements, the next design, and also the only one to be officially responsible of promoting the project (which is something I had absolutely no time for, actually)...
But now we are 5 (brilliant, dedicated, innovative and incredibly charismatic young geeks! *hum* yea). Needless to say, I'm quite happy with this new team :)

So what are we planning for this release?
  • A new design (prettier than the current one - hopefully :P).
  • The "list" feature, that will enable users to create lists of sentences.
  • Users profiles, for those who would like to tell more about themselves.
  • Private messages, so that users can contact each other.
  • And some sort of "Wall" (c.f. Facebook), to develop the community aspect of the project.
We will then start promoting the project more intensively, because this project NEEDS a community. There is SO MUCH that can be done to provide linguistic data and language tools that are free and of good quality. As far as we're concerned, we're digging the technical side, we're trying to set up the "ideal" system for this purpose. But obviously, without a strong community, nothing can be achieved. So we do hope that more and more people will join the cause.

On a side note, we also participated to the first "innovation competition" organized by the Université de Technologie de Compiègne (where three of us, myself included, are currently studying). And we have been selected in the first round (second round in mid-January).

Saturday, July 25, 2009

Activities for language students

A few months ago I had written about the idea of involving language teachers in the project. Before I illustrate a few ideas of how things can work, I'd like to answer to a question that you may ask yourself (if you are a language teacher) : why should I participate?

It's simple : Tatoeba can become a very useful resource... if only it had more data. The only way to gather the necessary data is to have a community working on it. You can be part of this community.
The data can be used for people who simply need to translate something, but also for programmers who would like to code language learning applications, and as well for researchers working on problematics related to language processing.
I'd also like to mention here that Tatoeba is quite an ambitious project, and what's more, a non-profit one. We really need as much help as possible.

Now, if you're a bit convinced, what can you do to help...

Translating any sentences

The most basic assignment would be to ask each student to translate a certain number of sentences every week. Since there are a lot more English and Japanese sentences, these sentences would usually be the "source" sentences. So if you're a French teacher in Japan, you would get your students to translate Japanese sentences into French. If you're an English teacher in France, you would get your students to translate English sentences into French. And so on. I'm focusing on French because it is my goal to translate all the English/Japanese sentences into French. But you could as well be an English teacher in Spain, and ask your students to translate English sentences into Spanish.

The only work you would have to do as a teacher is to check that everyone has contributed the amount of sentences you've required them to translate. The result is, even if you only have 20 students, and they each translated only 10 sentences each week for 10 weeks, that would still be 2000 additional sentences in the end.

However this type of activity may not be very connected to the things you are teaching during your classes. It only has the advantage that it will help increase quickly the amount data.

Translating sentences with specific words

Instead of translating just any sentence, you could ask your students to translate sentences with specific words in it. These words would of course be part of the vocabulary that you would like your students to learn.

This activity would be more useful in an academic context because the data can be used for studying. Students can retrieve the sentences they have translated, or that their peers have translated, in order to review vocabulary before a test.

Adding sentences with new vocabulary

What if there are no sentences with the vocabulary you'd like your students to learn? You can always ask them to find a sentence somewhere on the Web, add it in Tatoeba, and translate it.

Teaming up with another teacher

The problem with asking students to translate sentences, is that they are likely to make mistakes when they don't translate into their mother tongue. Ideally, when a student translates into his learning language, a native speaker should check that the translation actually means something.

I think the best way to do this is to team up with another teacher. In you were an English teacher in France, you would team up with a French teacher in the U.K. for instance. Your students would be checking French translations added by their English partners, and the British students would be checking the English translations added by your students. In the process they can help each other by explaining why it is wrong and what would be better (considering the students have enough of a decent level in their learning language so they can communicate with each other).

So these are only a few ideas, and of course we could always come up with more sophisticated things.

Anyway, if by any chance you are a language teacher and happen to be reading this, contact me!

Saturday, April 4, 2009

Collaboration with language teachers

My plan for this year 2009 is to find a few language teachers who would be interested in collaborating with Tatoeba Project in order to help increase the quantity and the quality of the English->French and Japanese->French translations.

Teachers have students, and they have authority on these students. Students either want to or have to learn the foreign language their teacher is teaching, and part of the learning process consist in doing translations. If this translation activity could take place on Tatoeba, it could greatly increase the data! I know that students will not contribute on their own initiative. It has to be either homework, or something to be done during class. That's why I need the collaboration of teachers.

For the record, there are currently over 150,000 sentences in English and Japanese, and about 24,000 French translations for these 150,000 sentences. Therefore, there are two ways to increase the number of French translations:
  1. Add an English or Japanese sentence, and translate it into French (which would have been the only way if all the English and Japanese sentence had a French translation).
  2. Translate an already existing English or Japanese sentence into French.
The good thing about the second way is that the contributors won't need to think of what sentences to add. However, it requires that they have a minimum of knowledge of the source language, and that the original sentences don't have any mistake, or it could confuse them.

This means that I will need the help of :
  • English<->French teachers : to bring French translations for the English sentences, and correct the English and French sentences that have mistakes.
  • Japenese<->French teachers : to bring French translations for the Japanese sentences, correct the Japanese and French sentences that have mistakes.
  • English<->Japanese teachers : to correct the Japanese and English sentences that have mistakes, and bring sentences that have new vocabulary.
I have described in another post how exactly things could work.

Friday, February 27, 2009

Reglas sobre el romaji en Tatoeba

(Thank you Luis for translating this into Spanish)

En general
En general, debes seguir esta tabla. 
  • ふ => fu (no hu)
  • づ => dzu (no zu) para diferenciarlo de ず que se convierte en 'zu'

Sobre partículas
  • は => wa (no ha)
  • を => o (no wo)
  • へ => e (no he)

Sobre ん
  • ん(n') cuando va seguida de una vocal. けんい(ken'i).
  • ん(n) en los demás casos. こんにちは(konnichiwa).

Sobre ~おう
  • No importa la situación, ~おう siempre queda como ~ou: 東京(toukyou), だろう(darou). No Tôkyô ni darô. 
  • Quisiera evitar usar acentos ya que es bastante complicado para aquellos que no tienen este tipo de acento en sus teclados.

Sobre katakana y mayúsculas
  • Katakana siempre se escribe con mayúsculas. パソコン(PASOKON).
  • Las mayúsculas se usan solo para los katakana, lo que significa que no debes usar mayúsculas al principio de la oración.

Sobre ー en katakana
  • No repitas la vocal, usa -. ゲーム(GE-MU).

Sobre ティ, ディ
  • ティ(TI). パーティ(PA-TI).
  • Lo mismo con ディ. ヂ(JI), ディ(DI).

Sobre los espacios
  • Espaciar es bastante molesto y hasta ahora no tenemos reglas para todo. En caso que no indiquemos lo contrario, haz lo que consideres mejor.
  • Siempre deja un espacio después de un verbo -te. 食べています(tabete imasu), やってみて(yatte mite), 愛してる(aishite ru).
  • No hay espacio en los adjetivos na. 上手な(jouzuna), ばかな(bakana), 本当な(hontouna)
  • Deja un espacio si hay un adjetivo antes de に : 上手に(jouzu ni), 本当に(hontou ni)
  • No hay espacio entre la forma masu y su raíz : "wakarimasu", and not "wakari masu"
  • Aún no tenemos reglas en los casos en que se juntan dos partículas. Por lo general no usamos espacios para "noni", "node", "demo", aunque sí espaciamos "ni wa", "de wa" (excepto dewa nai), "ni mo", etc.
  • Espacio antes -にくい, -やすい, -ながら, -つづける, etc.

Sobre citas
  • No usamos 「」 en romaji, sino comillas. 
  • Por ejemplo:「何時ですか」「10時半です」= "nanji desu ka" "juuji han desu"

Prefijos honoríficos お y ご
Tienes que agregar un guión después de “o” o “go”.
  • お誕生日 : o-tanjoubi
  • ご紹介 : go-shokai

Rules for romaji in Tatoeba

The rules in general
In general, you would follow this chart.
Exceptions : 
  • ふ => fu (and not hu)
  • づ => dzu (and not zu) to differentiate it from ず which is already converted as 'zu'

About particles
  • は => wa (and not ha)
  • を => o (and not wo)
  • へ => e (and not he)

About the ん
  • ん(n') when followed by a vowel. けんい(ken'i).
  • ん(n) if else. こんにちは(konnichiwa).

About the ~おう
  • No matter the situation, ~おう is converted as ~ou. 東京(toukyou), だろう(darou). Not Tôkyô or darô.
  • I would like to avoid accents because it's really not practical for those who don't have this accent on their keyboard.

About katakana and capital letters
  • Katakana always in capital letters. パソコン(PASOKON).
  • And capital letters ONLY for katakana! Which means you won't use a capital letter at the beginning of a sentence.

About the ー in katakana
  • Don't double the vowel, use -. ゲーム(GE-MU).

About ティ, ディ
  • ティ(TI). パーティ(PA-TI).
  • Same thing with DI. ヂ(JI), ディ(DI).

About spacing
  • Spacing is very annoying, and so far we don't have rules for everything. In case we didn't indicate what to do, just do whatever you feel is right.
  • Always put a space after a -te verb. 食べています(tabete imasu), やってみて(yatte mite), 愛してる(aishite ru).
  • Na adjectives : no space. 上手な(jouzuna), ばかな(bakana), 本当な(hontouna)
  • Space before に if it's after an adjective : 上手に(jouzu ni), 本当に(hontou ni)
  • No space between masu and radical : "wakarimasu", and not "wakari masu"
  • In case two particles are following each other, we do not have rules yet. I usually don't put spaces for "noni", "node", "demo". But I usually put a space for "ni wa", "de wa" (except dewa nai), "ni mo"...
  • Space before -にくい, -やすい, -ながら, -つづける, etc.

About the quotes
  • Don't use 「」in the romaji. Use the double quotes.
  • For instance : 「何時ですか」「10時半です」 = "nanji desu ka" "juuji han desu"

Honorific prefixes お and ご
You'd have to add a hyphen after "o" or "go".
  • お誕生日 : o-tanjoubi
  • ご紹介 : go-shokai

Sunday, February 8, 2009

Apprendre le japonais gratuitement sur Internet

Vous voulez apprendre le japonais mais...
  • vous parlez français (et en fait vous ne parlez QUE français)
  • vous êtes pauvre (mais quand même assez riche pour avoir Internet)
  • vous êtes motivé (ULTRA motivé)
Pour vous épargnez la corvée de devoir explorer vous-même les fins fonds du Web à la recherche de tous les sites Internet qui pourraient vous aider dans cette quête, je vous partage ma liste (pour l'instant en vrac). Ce sont exclusivement des sites en français. Je n'ai pas inclus les sites anglais (il y a vraiment beaauucoup plus de ressources en anglais).

Site claire et complet. Semble être une bonne ressource pour les débutants.

Très peu de contenu relatif à la langue japonaise. Il y a une introduction à la prononciation et l'écriture, quelques exercices, mais sans plus.

Axé sur l'écriture japonaise. Aucune ressource niveau grammaire.

Comporte 25 leçons expliquant quelques bases de la langue japonaise. Ressource potentielle pour l'apprentissage du japonais, mais niveau design... hum...

Surtout axé vocabulaire. Je n'ai trouvé aucune leçon de grammaire.

Tiens, c'est un des site que j'avais l'habitude de visiter dans mes débuts avec le japonais. Je pense qu'il est plutôt bien organisé et convient bien pour les débutants.

Peu d'explications grammaticales. Surtout des explications sur l'écriture.

Il y a pas mal d'explications grammaticales pour les débutants. Mais ça recoupe sans doute beaucoup ce qu'on peut déjà trouver dans les site précédents.

Pratiquement pas de grammaire. Surtout des explications sur la prononciation et l'écriture japonaise.

Je n'ai pas essayé de m'inscrire, mais potentiellement intéressant.

Pas de grammaire. Axé vocabulaire et écriture.

Surtout du vocabulaire.

Quelques base de grammaire.

Idem, quelques bases de grammaire.

Notions de base sur l'écriture et la grammaire.

Pour apprendre les caractères japonais.

Concerne l'écriture japonaise.

Vocabulaire pour JLPT. Ah zut c'est en anglais... Bon en même temps ce ne sont que des listes de vocabulaire.

Pour préparer le JLPT.

Réviser les kanjis.

Explications sur la prononciation et les bases du japonais.

Apprentissage des caractères.

Pour réviser les caractères.

Apprendre les kanjis.

Notions de grammaire.

Écriture, un peu de vocabulaire, quelques notions de japonais.

Faut bien chercher pour le trouver, mais le contenu est pas mal niveau grammaire.

Pour ceux qui passent leur bac de japonais, il y a la liste des notions à maîtriser (peut-être pas à jour). En tous cas, même pour ceux qui ne passent pas le bac, ça donne toujours un fil conducteur dans l'apprentissage!

Dico des kanjis.

Dictionnaire japonais.

Un autre dico.

Encore un dico.

Lui aussi il a l'air très complet.

Dictionnaire téléchargeable.

Encore peu de contenu, mais site récent et donc encore en évolution. A suivre.

Pas mal de contenu.


Il fait un peu mal aux yeux ce site. Mais il semble y avoir pas mal de ressources niveau grammaire.

Là aussi, il fait un peu mal aux yeux, mais il y a pas mal de ressources.


Pour ceux qui passent leur bac de japonais.

Pour pratiquer le japonais avec des japonais.

Saturday, February 7, 2009

Tools for Japanese romanization

Japanese to romaji conversion in Tatoeba

I have recently re-implemented KAKASI, a little tool that was present in the old Tatoeba and that can convert Japanese into romaji or furigana. You can find a "Romaji & Furigana" link to this converter at the bottom of Tatoeba website, along with "Contact", "Tatoeba Blog" and "Downloads".

I'm using it to convert automatically the Japanese sentences into romaji. But you have to know that the conversion is far from being perfect

Why can't I edit the romaji?

In the old Tatoeba, I had converted all the Japanese sentences into romaji, saved them in the database and allowed people to correct the romaji generated. But in the new version, I figured it wasn't worth it. Instead I'm just going to provide "on the fly" conversion, so you will not be able to correct a specific romaji sentence.

The reason behind this is that there are more than 150,000 Japanese sentences. If we do the math and assume that it takes an average of 10 seconds to validate a romaji sentence (validate means read + correct if necessary), that's 1 500 000 seconds spent on validating all the romaji generated by KAKASI. That's about 416 hours... It's not that much if you have a thousand dedicated people fluent in Japanese working for you, the problem can be solved within 30 minutes. But Tatoeba doesn't have so much manpower and it will surely take more than 416 to gather the necessary human resources so we'll try to get the machine do the work.

Japanese to romaji conversion softwares

I haven't tried all the free software out there than allows you to convert Japanese into romaji (actually KAKASI is the only one I tried), but here's a small list. If you know any other free software, let me know.

I don't think they all convert exactly to romaji. Perhaps some of them only parses the Japanese text (i.e. put spaces where they can potentially be a space) and provide the hiragana. But this is really the most difficult task : to put the spaces at the right place and convert correctly the kanji into hiragana.

Anyways, I'm going to be lazy and stick with KAKASI for now, trying to improve as much as possible the output it generates. 

What can be done to improve the romaji output

Surely there can be a better way to fix the romanization, but for now the simplest solution is to analyze the output KAKASI generates, and set rules to replace the wrong romaji with the correct one. This will fix the most recurrent mistakes. For instance ではない is systematically converted into dehanai. So we just set a rule that says : replace "dehanai" by "dewa nai".

The whole list of rules can be found here : 
(Note : you'll have to understand regular expressions to understand what these lines mean)

Whenever you find something wrong with the romaji generated, just try to figure out what needs to be replaced by what, and let me know. I'll add the rule to the list.

NB : You may want to know what romanization rules are used in Tatoeba.

Saturday, January 31, 2009

New address :

Tatoeba moved to another server, the old one being very unreliable lately... In the process, the official address became 

The other one,, still works of course. But it will redirect you to the French version of the website.

Saturday, January 24, 2009

Nouveau système de validation


Il y a actuellement plus de 330 000 phrases dans Tatoeba (toutes langues inclues). La plupart viennent d'un corpus japonais-anglais appelé le Tanaka corpus. Une partie de ce corpus a été traduit vers le français il y a environ un an et demi, grâce à l'initiative du webmaster de Tokidoki, qui plus tard m'a donné ces traductions pour les intégrer dans Tatoeba.

Nous avons maintenant environ 150 000 phrases en anglais, à peu près la même quantité en japonais, et presque 24 000 en français.

Le problème est que beaucoup de ces phrases comportent encore des fautes. Et pour comprendre pourquoi, vous devez comprendre comment ces phrases ont été collectées.

Tanaka Corpus

Pour ceux qui n'auraient pas lu la page concernant le Tanaka Corpus, ou qui ne parlent pas assez bien l'anglais, voici l'explication (et rapide traduction) :
Les étudiants du professeur Tanaka ont reçu la tâche de rassembler chacun 300 paires de phrases. Après plusieurs années, 212 000 paires ont été rassembées.


La collection originale contenaient de nombreuses erreurs, à la fois en japonais et en anglais. Beaucoup de ces erreurs étaient des fautes d'orthographe et de transcription, bien que dans un nombre significatif de cas, les phrases japonaises et anglaises contenaient des erreurs grammaticales, syntaxiques, etc., ou encore, les traduction n'étaient pas du tout en concordance.
Un énorme travail a été effectué pour maintenir ce corpus, et il a été effectué principalement par un seul homme (Paul Blau). On ne pouvait pas attendre de lui qu'il élimine toutes les fautes.

Traductions françaises

Les traductions françaises que j'ai reçu étaient le résultat du travail de 80 volontaires. L'idée de ce projet de traduction était de d'abord traduire autant de phrases que possible, même si ce n'était pas toujours correct. Et seulement ultérieurement, passer par une phase de vérification. Le projet s'est arrêté après peu de temps cependant, et les phrases qui ont été déjà traduite n'ont pas eu l'occasion d'être vérifiées.

Ancien système de validation

Dans l'ancienne version de Tatoeba, toute nouvelle contribution n'était pas directement ajoutée dans le reste de la collection. Au lieu de cela, elle était ajoutée dans une liste d'attente. Les modérateurs pouvaient accéder à cette lites, valider les contributions correctes, et refuser celles qui ne l'étaient pas. Cela avait pour but d'empêcher d'augmenter le nombre de phrases ou traduction incorrectes.

Mais à moins d'avoir un solide group de modérateurs dévoués et qualifiés, ce genre ce système était clairement très lent et très lourd.

Nouveau système de validation

Dans le nouveau système de validation, il n'y a plus de modérateurs. Au lieu de cela, chaque phrase appartiendra à un propriétaire, et seul le propriétaire peut modifier la phrase. Les contributeurs seront responsables des phrases qu'ils possèdent. Si vous voyez une faute dans une phrase qui n'est pas la vôtre, vous pouvez poster un commentaire à ce sujet. Bien entendu, chaque utilisateur pourra rapidement accéder aux commentaires qui ont été écrits à propos des phrases qu'ils possèdent.

Si un utilisateur ou une utilisatrice ne se sent pas capable de prendre la responsabilité, il ou elle peut renoncer à la propriété d'une phrase. Ces phrases "orphelines" pourront être adoptées par d'autres utilisateurs. Actuellement, je peux vous dire que la plupart des phrases sont orphelines, et le but est de leur trouver un parent.

En plus de cela, il sera possible pour tout le monde de suivre ce que d'autres contributeurs font dans Tatoeba. Dans le cas où des gens ne font pas du bon travail et bloquent de nombreuses phrases qui ont des fautes en les adoptant et en ne les corrigeant pas, it ne sera pas difficile de leur retirer leur droits.

Thursday, January 22, 2009

New validation system


There are currently over 330,000 sentences in Tatoeba (all languages included). Most of them come from an English-Japanese corpus named Tanaka Corpus. Part of this corpus was translated into French about a year and a half ago thanks to the initiative of Tokidoki's webmaster, who later gave me the translations to integrate into Tatoeba.

We have now about 150,000 sentences in English, about the same quantity in Japanese, and almost 24,000 in French.

The problem is, many of these sentences still have mistakes. And to understand why, you have to understand how those sentences were collected. 

Tanaka Corpus

For those who didn't want to read the page about the Tanaka Corpus, here's the explanation :
Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected
The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.
A huge work has been done to maintain this corpus, but it was done mostly by one man (Paul Blay), and you couldn't expect him to get rid of all the mistakes.

French translations

The French translations that were given to me were the result of the work of 80 vonlonteers. The idea of this translation project was first of all to translate as much as possible, even if it's not always correct. And then only later, go through a phrase of verification. The project stopped early though, and the already translated sentences didn't get to go through verification.

Old validation system

In the old version of Tatoeba, every new contribution was not directly added into the rest of the sentences collection. Instead, it was added in a waiting list. Moderators could see this list, validate the sentences that were correct and refuse those that were not. It was aimed to prevent additional wrongly spelled sentences or even wrong translations.

But unless I had a bunch of devoted and very qualified moderators (which I didn't), this kind of system was clearly very slow and heavy.

New validation system

In the new validation system, there are no moderators anymore. Instead, each sentence will have a owner, and only the owner can modify the sentence. Contributors will be responsible of the sentences they own. If you see a mistake in a sentence that is not yours, you can post a comment about it. Of course, each user will be able to quickly access to the comments that were posted about their sentences.

If a user doesn't feel (s)he can take the responsibility, (s)he will have the possibility to renounce to the ownership of a sentence. These "orphan" sentences can be adopted by other users. Right now I can tell you that most of the sentences are orphans and the goal is to make find them a parent.

On top of that, it will be possible for every user to follow other users' contributions in Tatoeba. In case some people are not doing a good job and are blocking many many sentences that have mistakes by adopting them and not correcting them, it won't be difficult to withdraw their ownership.

Monday, January 19, 2009

Unstable server

For anyone who has the reflex to come read this blog when there's a problem accessing, you must know that the current server where Tatoeba is hosted is somewhat unstable.

The project should be moved to another server sometime in February.

Sunday, January 18, 2009

Better now

Everything is up again. And it's faster now.

I'm temporarily taking out the "Logs" and "Statistics" until I get to optimize these parts too.

In the process, I also tried modifying a little bit the layout so people understand that when they translate, they should base their translation on the main sentence. I'm not sure how to make it clear, but I hope having these arrows in front of each translation will do the job. I also made the "warning" icon more agressive so that people would more likely read it.

It's time to optimize

Well, after having the occasion to try running the new Tatoeba in real conditions for a few hours, it turns out that it's really, really, really slow and ended up crashing the server... Not really the best time for it to happen. I guess I underestimated the consequences of caring too little about optimization.

Sorry for those who needed to search in the corpus, and those who tried to re-confirm their registration but couldn't because the server is down. I'll be more careful next time. Hopefully I can get everything fixed by the end of the weekend.

Friday, January 16, 2009

What about the other data of the old Tatoeba?

The forum

It is very unlikely that I try to migrate the old forum posts into the new Tatoeba. First of all, there is no forum anymore in the new Tatoeba. Instead there is a "Comments" section, which lists the latest comments about the sentences. 
I will set up a new forum someday, but probably not before a couple of months.


Most of the documentation is not relevant anymore in the new Tatoeba. I'll take the time to update it though. I will use this blog to store the new documentation articles.


In the new version, the sentences will be considered as added by unknown user at the date when the migration was done. There will be no more traces of the evolution of a sentence (modifications, suggestion of corrections, validation).
For a few thousands of sentences, I was able to retrieve the author and date when it was added, but that's all I could do. I suppose it is enough.


The statistics are based on the logs. Let's say in the previous version you had added 5 new sentences, and translated 7 sentences. In the new version, your stats will say that you have added 12 sentences, but there will be no indication on which ones you have translated. It will be considered as if you had added them as single sentences.

Sunday, January 11, 2009

New version

A new version of Tatoeba will be available soon! Optimistically, it will be online before next weekend (that is before January 16th). If not, then it will be at the end of the month.

Along with the new version, I have decided to create this blog where will be published information about the evolution of the project, as well as some documentation related to it. So if you are interested in what's going on, come back here once in a while. Hopefully I will be motivated enough to keep this blog up-to-date.

Features for this new version
  • Add a sentence - Well, I don't need to explain that one.
  • Translate a sentence - Quite often, you can translate a same sentence in different ways. But, the current version of Tatoeba will allow you to add only one translation in each language. In the new version, you will be able to add as many translations as you want, in any language you want.
  • Modify a sentence - You can only modify the sentences that you have added. If you notice a mistake in a sentence that is not yours, you will have to post a comment about it.
  • Comment a sentence - The comments can be used for notifying a mistake, asking for an explanation, explaining in what context the sentence can be used, specifying the source of the sentence, etc.
  • Language auto-detection - This spares you the very difficult task of specifying in which language you are writing. Note : the auto-detection may not work in some cases. But the sentence will still be saved.
  • Search - I don't need to explain this either.
  • Logs and statistics - You can check out how active is the community by looking at the logs, and who are the most active people by looking at the statistics. Note that the logs and statistics will all be reseted. Everyone starts from zero again. I will still make a special page in memory of those who have contributed a lot in the old Tatoeba.

What next?
  • Indexed Japanese sentences : to handle the Tanaka Corpus's "B line".
  • Download sentences : because it's nice to share.
  • Mark sentences as verified : to improve the quality of the sentences.
  • Mark translations as verified : to improve the quality of the translations.
If you believe there is a feature I have not mentioned but that is more important than those listed above, let me know.

Other stuff you may want to know

I will disable the possibility to add/translate/edit sentences until the new version is up.
I will surely change switch to (instead of for the official URL of the project.