Showing posts with label validation. Show all posts
Showing posts with label validation. Show all posts

Friday, April 30, 2010

Reliability of the sentences, how will we handle this?

Reliability has always been a big issue in Tatoeba. Many sentences have mistakes but there is currently no indication on whether a sentence is correct, or whether it is an accurate translation. So when you look at a sentence, you can never be 100% sure if you can rely on it.

We will start introducing some measures to solve this.

The first objective is to have all the sentences in Tatoeba adopted by someone. Languages that are especially concerned are Japanese, English and French. They are the main languages in Tatoeba and I have explained in a very old post where they come from. The post also explains the idea behind "adoption" of sentences. You can also read this discussion where I explained more in details my point of view on the "adopt" feature.

However, adopting is not enough. Sooner or later we will need some sort of "vote system". But before integrating a new feature, we can use what we already have. This article describes how we are likely to proceed, but since this is still experimental, it is of course not bound to remain as described. The procedure can be improved as we experiment it, and all feedback is welcome.


Step 1 - Generating lists

We will generate lists, that could be named following the template :
[checking] $languageCode ($languageName), $whateverYouWant

For instance :
[checking] fra (French), list 1
These lists will be private, and each of them will be attributed to the person who has to check the sentences in that list. They will be filled (in priority) with "orphan sentences", which won't remain orphan very long because we will have them automatically assigned to the person in charge of the checking.


Step 2 - Checking

Users will have to check sentences in their native language only. Most importantly, they will NOT be checking the accuracy of translations, only the sentences themselves, i.e. if there is any spelling or grammar mistake. We will deal with accuracy of translations much later.

While correcting sentences, the meaning of the sentence must never be altered.

There may be some cases where you are not sure whether the sentence should be edited or not. For instance, sentences that are archaic or sentences that are grammatically correct, but do not sound like what a native speaker would say. There is no absolute rule... Usually, if the sentence is archaic, you can leave it alone. If it is grammatically correct but not "natural", then try to make it sound natural. At any rate, if you are not sure what to do, post a comment on the sentence and we will see what should be done about it.


Step 3 - Marking sentences as checked

If you are volunteering to check sentences, then in the first place, you will actually have to check and correct all the sentences you own (and that are in your native language). Because we will consider as "checked" all these sentences, except those that are in your "checking" list which are of course being checked.

All the sentences that are not adopted or that belong to someone who is not part of the "checking team" will be considered as not checked.

Once you are done checking sentences in your list, we will by default renew the list.
Or you can rename your checking list into "[checked] $lang ...", in case you want to keep track of the various batches of sentences you have checked. We will then generate a new list for you.

Anyway, as you can see, with the features we have, we can already start reviewing sentences "en masse". The problem, however, is that the list feature is not exactly optimized for this...


Step 4 - Exporting lists into CSV file

This will be a useful feature in general, but in our particular case, it will enable people to check sentences offline, as well as execute a "replace all" if they come across a recurrent mistake.

On that matter, don't hesitate to send us an email and tell us about recurrent mistakes you find and that can be corrected systematically. This will help us create a script to correct these mistakes in the non-adopted sentences so that contributors won't have to waste time working on things that can be processed automatically.


Step 5 - Re-importing CSV file

People who decide to correct from the CSV file can later import back their corrections. All the corrections made in the file will be applied to the sentences in Tatoeba. At this time, it is still not very clear how the re-import process will be handled.


Step 6 - Re-adapting the list page

Some people may prefer checking from a page in Tatoeba rather than a from a file so that they can easily post a comment when needed, unadopt the sentence and leave it to someone else to deal with if it's just too difficult for them, favorite it, or add it to another list...

But the list page is not exactly optimized for that, so we will try to provide a page where users can easily to these things while checking.


Step 7 - Reorganizing the lists

There may be a point where the lists section starts getting a bit too messy. We will certainly have to introduce some categorizations for the lists (we'll have at least "checking" and "to translate"). This is still an open idea though, nothing guaranteed.


Step 8 - Integrating a vote system

We will start working on a vote system only when we have a larger number of active contributors. Right now, the number of active contributor in a given language probably doesn't exceed 5. There has to be at least 20, in my opinion, for the vote system to be worth implementing.

This is probably NOT something we will work on before another 3 or 4 months.


Step 9 - Locking sentences

Eventually, when a sentence has been checked and/or discussed again, again and again, it would make sense to lock it so that no one can edit it anymore, not even the owner.

The fact that a sentence is locked will also be the guarantee that a sentence is completely reliable. But again, this won't be integrated before at least several months.


Phase 2 - Step 1 - Checking translations

Well, I'll write another post about this when we get there, because this is a trickier issue...


Let's do the math

Supposing it takes an average of 5 seconds to check a sentence and correcting it if needed (which is quite optimistic). We have about 370,000 sentences. That's 500+ hours of checking. So with enough people, it's not that much.

(Optimistically)
Japanese would need about 210 hours.
English, 200 hours.
French, 40 hours.
German, 16 hours.
Polish, 13 hours.
Well, we have other languages, but I won't list them all here...

French will be the first language where we will start pouring our efforts (the project is based in France, after all). We can surely get the French sentences checked by the end of May.

Saturday, January 24, 2009

Nouveau système de validation

Contexte

Il y a actuellement plus de 330 000 phrases dans Tatoeba (toutes langues inclues). La plupart viennent d'un corpus japonais-anglais appelé le Tanaka corpus. Une partie de ce corpus a été traduit vers le français il y a environ un an et demi, grâce à l'initiative du webmaster de Tokidoki, qui plus tard m'a donné ces traductions pour les intégrer dans Tatoeba.

Nous avons maintenant environ 150 000 phrases en anglais, à peu près la même quantité en japonais, et presque 24 000 en français.

Le problème est que beaucoup de ces phrases comportent encore des fautes. Et pour comprendre pourquoi, vous devez comprendre comment ces phrases ont été collectées.


Tanaka Corpus

Pour ceux qui n'auraient pas lu la page concernant le Tanaka Corpus, ou qui ne parlent pas assez bien l'anglais, voici l'explication (et rapide traduction) :
Les étudiants du professeur Tanaka ont reçu la tâche de rassembler chacun 300 paires de phrases. Après plusieurs années, 212 000 paires ont été rassembées.

[...]

La collection originale contenaient de nombreuses erreurs, à la fois en japonais et en anglais. Beaucoup de ces erreurs étaient des fautes d'orthographe et de transcription, bien que dans un nombre significatif de cas, les phrases japonaises et anglaises contenaient des erreurs grammaticales, syntaxiques, etc., ou encore, les traduction n'étaient pas du tout en concordance.
Un énorme travail a été effectué pour maintenir ce corpus, et il a été effectué principalement par un seul homme (Paul Blau). On ne pouvait pas attendre de lui qu'il élimine toutes les fautes.


Traductions françaises

Les traductions françaises que j'ai reçu étaient le résultat du travail de 80 volontaires. L'idée de ce projet de traduction était de d'abord traduire autant de phrases que possible, même si ce n'était pas toujours correct. Et seulement ultérieurement, passer par une phase de vérification. Le projet s'est arrêté après peu de temps cependant, et les phrases qui ont été déjà traduite n'ont pas eu l'occasion d'être vérifiées.


Ancien système de validation

Dans l'ancienne version de Tatoeba, toute nouvelle contribution n'était pas directement ajoutée dans le reste de la collection. Au lieu de cela, elle était ajoutée dans une liste d'attente. Les modérateurs pouvaient accéder à cette lites, valider les contributions correctes, et refuser celles qui ne l'étaient pas. Cela avait pour but d'empêcher d'augmenter le nombre de phrases ou traduction incorrectes.

Mais à moins d'avoir un solide group de modérateurs dévoués et qualifiés, ce genre ce système était clairement très lent et très lourd.


Nouveau système de validation

Dans le nouveau système de validation, il n'y a plus de modérateurs. Au lieu de cela, chaque phrase appartiendra à un propriétaire, et seul le propriétaire peut modifier la phrase. Les contributeurs seront responsables des phrases qu'ils possèdent. Si vous voyez une faute dans une phrase qui n'est pas la vôtre, vous pouvez poster un commentaire à ce sujet. Bien entendu, chaque utilisateur pourra rapidement accéder aux commentaires qui ont été écrits à propos des phrases qu'ils possèdent.

Si un utilisateur ou une utilisatrice ne se sent pas capable de prendre la responsabilité, il ou elle peut renoncer à la propriété d'une phrase. Ces phrases "orphelines" pourront être adoptées par d'autres utilisateurs. Actuellement, je peux vous dire que la plupart des phrases sont orphelines, et le but est de leur trouver un parent.

En plus de cela, il sera possible pour tout le monde de suivre ce que d'autres contributeurs font dans Tatoeba. Dans le cas où des gens ne font pas du bon travail et bloquent de nombreuses phrases qui ont des fautes en les adoptant et en ne les corrigeant pas, it ne sera pas difficile de leur retirer leur droits.

Thursday, January 22, 2009

New validation system

Context

There are currently over 330,000 sentences in Tatoeba (all languages included). Most of them come from an English-Japanese corpus named Tanaka Corpus. Part of this corpus was translated into French about a year and a half ago thanks to the initiative of Tokidoki's webmaster, who later gave me the translations to integrate into Tatoeba.

We have now about 150,000 sentences in English, about the same quantity in Japanese, and almost 24,000 in French.

The problem is, many of these sentences still have mistakes. And to understand why, you have to understand how those sentences were collected. 


Tanaka Corpus

For those who didn't want to read the page about the Tanaka Corpus, here's the explanation :
Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected
[...]
The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.
A huge work has been done to maintain this corpus, but it was done mostly by one man (Paul Blay), and you couldn't expect him to get rid of all the mistakes.


French translations

The French translations that were given to me were the result of the work of 80 vonlonteers. The idea of this translation project was first of all to translate as much as possible, even if it's not always correct. And then only later, go through a phrase of verification. The project stopped early though, and the already translated sentences didn't get to go through verification.


Old validation system

In the old version of Tatoeba, every new contribution was not directly added into the rest of the sentences collection. Instead, it was added in a waiting list. Moderators could see this list, validate the sentences that were correct and refuse those that were not. It was aimed to prevent additional wrongly spelled sentences or even wrong translations.

But unless I had a bunch of devoted and very qualified moderators (which I didn't), this kind of system was clearly very slow and heavy.


New validation system

In the new validation system, there are no moderators anymore. Instead, each sentence will have a owner, and only the owner can modify the sentence. Contributors will be responsible of the sentences they own. If you see a mistake in a sentence that is not yours, you can post a comment about it. Of course, each user will be able to quickly access to the comments that were posted about their sentences.

If a user doesn't feel (s)he can take the responsibility, (s)he will have the possibility to renounce to the ownership of a sentence. These "orphan" sentences can be adopted by other users. Right now I can tell you that most of the sentences are orphans and the goal is to make find them a parent.

On top of that, it will be possible for every user to follow other users' contributions in Tatoeba. In case some people are not doing a good job and are blocking many many sentences that have mistakes by adopting them and not correcting them, it won't be difficult to withdraw their ownership.