Saturday, January 31, 2009

New address : tatoeba.org

Tatoeba moved to another server, the old one being very unreliable lately... In the process, the official address became http://tatoeba.org. 

The other one, http://tatoeba.fr, still works of course. But it will redirect you to the French version of the website.

Saturday, January 24, 2009

Nouveau système de validation

Contexte

Il y a actuellement plus de 330 000 phrases dans Tatoeba (toutes langues inclues). La plupart viennent d'un corpus japonais-anglais appelé le Tanaka corpus. Une partie de ce corpus a été traduit vers le français il y a environ un an et demi, grâce à l'initiative du webmaster de Tokidoki, qui plus tard m'a donné ces traductions pour les intégrer dans Tatoeba.

Nous avons maintenant environ 150 000 phrases en anglais, à peu près la même quantité en japonais, et presque 24 000 en français.

Le problème est que beaucoup de ces phrases comportent encore des fautes. Et pour comprendre pourquoi, vous devez comprendre comment ces phrases ont été collectées.


Tanaka Corpus

Pour ceux qui n'auraient pas lu la page concernant le Tanaka Corpus, ou qui ne parlent pas assez bien l'anglais, voici l'explication (et rapide traduction) :
Les étudiants du professeur Tanaka ont reçu la tâche de rassembler chacun 300 paires de phrases. Après plusieurs années, 212 000 paires ont été rassembées.

[...]

La collection originale contenaient de nombreuses erreurs, à la fois en japonais et en anglais. Beaucoup de ces erreurs étaient des fautes d'orthographe et de transcription, bien que dans un nombre significatif de cas, les phrases japonaises et anglaises contenaient des erreurs grammaticales, syntaxiques, etc., ou encore, les traduction n'étaient pas du tout en concordance.
Un énorme travail a été effectué pour maintenir ce corpus, et il a été effectué principalement par un seul homme (Paul Blau). On ne pouvait pas attendre de lui qu'il élimine toutes les fautes.


Traductions françaises

Les traductions françaises que j'ai reçu étaient le résultat du travail de 80 volontaires. L'idée de ce projet de traduction était de d'abord traduire autant de phrases que possible, même si ce n'était pas toujours correct. Et seulement ultérieurement, passer par une phase de vérification. Le projet s'est arrêté après peu de temps cependant, et les phrases qui ont été déjà traduite n'ont pas eu l'occasion d'être vérifiées.


Ancien système de validation

Dans l'ancienne version de Tatoeba, toute nouvelle contribution n'était pas directement ajoutée dans le reste de la collection. Au lieu de cela, elle était ajoutée dans une liste d'attente. Les modérateurs pouvaient accéder à cette lites, valider les contributions correctes, et refuser celles qui ne l'étaient pas. Cela avait pour but d'empêcher d'augmenter le nombre de phrases ou traduction incorrectes.

Mais à moins d'avoir un solide group de modérateurs dévoués et qualifiés, ce genre ce système était clairement très lent et très lourd.


Nouveau système de validation

Dans le nouveau système de validation, il n'y a plus de modérateurs. Au lieu de cela, chaque phrase appartiendra à un propriétaire, et seul le propriétaire peut modifier la phrase. Les contributeurs seront responsables des phrases qu'ils possèdent. Si vous voyez une faute dans une phrase qui n'est pas la vôtre, vous pouvez poster un commentaire à ce sujet. Bien entendu, chaque utilisateur pourra rapidement accéder aux commentaires qui ont été écrits à propos des phrases qu'ils possèdent.

Si un utilisateur ou une utilisatrice ne se sent pas capable de prendre la responsabilité, il ou elle peut renoncer à la propriété d'une phrase. Ces phrases "orphelines" pourront être adoptées par d'autres utilisateurs. Actuellement, je peux vous dire que la plupart des phrases sont orphelines, et le but est de leur trouver un parent.

En plus de cela, il sera possible pour tout le monde de suivre ce que d'autres contributeurs font dans Tatoeba. Dans le cas où des gens ne font pas du bon travail et bloquent de nombreuses phrases qui ont des fautes en les adoptant et en ne les corrigeant pas, it ne sera pas difficile de leur retirer leur droits.

Thursday, January 22, 2009

New validation system

Context

There are currently over 330,000 sentences in Tatoeba (all languages included). Most of them come from an English-Japanese corpus named Tanaka Corpus. Part of this corpus was translated into French about a year and a half ago thanks to the initiative of Tokidoki's webmaster, who later gave me the translations to integrate into Tatoeba.

We have now about 150,000 sentences in English, about the same quantity in Japanese, and almost 24,000 in French.

The problem is, many of these sentences still have mistakes. And to understand why, you have to understand how those sentences were collected. 


Tanaka Corpus

For those who didn't want to read the page about the Tanaka Corpus, here's the explanation :
Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected
[...]
The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.
A huge work has been done to maintain this corpus, but it was done mostly by one man (Paul Blay), and you couldn't expect him to get rid of all the mistakes.


French translations

The French translations that were given to me were the result of the work of 80 vonlonteers. The idea of this translation project was first of all to translate as much as possible, even if it's not always correct. And then only later, go through a phrase of verification. The project stopped early though, and the already translated sentences didn't get to go through verification.


Old validation system

In the old version of Tatoeba, every new contribution was not directly added into the rest of the sentences collection. Instead, it was added in a waiting list. Moderators could see this list, validate the sentences that were correct and refuse those that were not. It was aimed to prevent additional wrongly spelled sentences or even wrong translations.

But unless I had a bunch of devoted and very qualified moderators (which I didn't), this kind of system was clearly very slow and heavy.


New validation system

In the new validation system, there are no moderators anymore. Instead, each sentence will have a owner, and only the owner can modify the sentence. Contributors will be responsible of the sentences they own. If you see a mistake in a sentence that is not yours, you can post a comment about it. Of course, each user will be able to quickly access to the comments that were posted about their sentences.

If a user doesn't feel (s)he can take the responsibility, (s)he will have the possibility to renounce to the ownership of a sentence. These "orphan" sentences can be adopted by other users. Right now I can tell you that most of the sentences are orphans and the goal is to make find them a parent.

On top of that, it will be possible for every user to follow other users' contributions in Tatoeba. In case some people are not doing a good job and are blocking many many sentences that have mistakes by adopting them and not correcting them, it won't be difficult to withdraw their ownership.

Monday, January 19, 2009

Unstable server

For anyone who has the reflex to come read this blog when there's a problem accessing http://tatoeba.fr, you must know that the current server where Tatoeba is hosted is somewhat unstable.

The project should be moved to another server sometime in February.

Sunday, January 18, 2009

Better now

Everything is up again. And it's faster now.

I'm temporarily taking out the "Logs" and "Statistics" until I get to optimize these parts too.

In the process, I also tried modifying a little bit the layout so people understand that when they translate, they should base their translation on the main sentence. I'm not sure how to make it clear, but I hope having these arrows in front of each translation will do the job. I also made the "warning" icon more agressive so that people would more likely read it.

It's time to optimize

Well, after having the occasion to try running the new Tatoeba in real conditions for a few hours, it turns out that it's really, really, really slow and ended up crashing the server... Not really the best time for it to happen. I guess I underestimated the consequences of caring too little about optimization.

Sorry for those who needed to search in the corpus, and those who tried to re-confirm their registration but couldn't because the server is down. I'll be more careful next time. Hopefully I can get everything fixed by the end of the weekend.

Friday, January 16, 2009

What about the other data of the old Tatoeba?

The forum

It is very unlikely that I try to migrate the old forum posts into the new Tatoeba. First of all, there is no forum anymore in the new Tatoeba. Instead there is a "Comments" section, which lists the latest comments about the sentences. 
I will set up a new forum someday, but probably not before a couple of months.


Documentation

Most of the documentation is not relevant anymore in the new Tatoeba. I'll take the time to update it though. I will use this blog to store the new documentation articles.


Logs

In the new version, the sentences will be considered as added by unknown user at the date when the migration was done. There will be no more traces of the evolution of a sentence (modifications, suggestion of corrections, validation).
For a few thousands of sentences, I was able to retrieve the author and date when it was added, but that's all I could do. I suppose it is enough.


Statistics

The statistics are based on the logs. Let's say in the previous version you had added 5 new sentences, and translated 7 sentences. In the new version, your stats will say that you have added 12 sentences, but there will be no indication on which ones you have translated. It will be considered as if you had added them as single sentences.

Sunday, January 11, 2009

New version

A new version of Tatoeba will be available soon! Optimistically, it will be online before next weekend (that is before January 16th). If not, then it will be at the end of the month.

Along with the new version, I have decided to create this blog where will be published information about the evolution of the project, as well as some documentation related to it. So if you are interested in what's going on, come back here once in a while. Hopefully I will be motivated enough to keep this blog up-to-date.


Features for this new version
  • Add a sentence - Well, I don't need to explain that one.
  • Translate a sentence - Quite often, you can translate a same sentence in different ways. But, the current version of Tatoeba will allow you to add only one translation in each language. In the new version, you will be able to add as many translations as you want, in any language you want.
  • Modify a sentence - You can only modify the sentences that you have added. If you notice a mistake in a sentence that is not yours, you will have to post a comment about it.
  • Comment a sentence - The comments can be used for notifying a mistake, asking for an explanation, explaining in what context the sentence can be used, specifying the source of the sentence, etc.
  • Language auto-detection - This spares you the very difficult task of specifying in which language you are writing. Note : the auto-detection may not work in some cases. But the sentence will still be saved.
  • Search - I don't need to explain this either.
  • Logs and statistics - You can check out how active is the community by looking at the logs, and who are the most active people by looking at the statistics. Note that the logs and statistics will all be reseted. Everyone starts from zero again. I will still make a special page in memory of those who have contributed a lot in the old Tatoeba.

What next?
  • Indexed Japanese sentences : to handle the Tanaka Corpus's "B line".
  • Download sentences : because it's nice to share.
  • Mark sentences as verified : to improve the quality of the sentences.
  • Mark translations as verified : to improve the quality of the translations.
If you believe there is a feature I have not mentioned but that is more important than those listed above, let me know.


Other stuff you may want to know

I will disable the possibility to add/translate/edit sentences until the new version is up.
I will surely change switch to tatoeba.org (instead of tatoeba.fr) for the official URL of the project.