Friday, April 30, 2010

Reliability of the sentences, how will we handle this?

Reliability has always been a big issue in Tatoeba. Many sentences have mistakes but there is currently no indication on whether a sentence is correct, or whether it is an accurate translation. So when you look at a sentence, you can never be 100% sure if you can rely on it.

We will start introducing some measures to solve this.

The first objective is to have all the sentences in Tatoeba adopted by someone. Languages that are especially concerned are Japanese, English and French. They are the main languages in Tatoeba and I have explained in a very old post where they come from. The post also explains the idea behind "adoption" of sentences. You can also read this discussion where I explained more in details my point of view on the "adopt" feature.

However, adopting is not enough. Sooner or later we will need some sort of "vote system". But before integrating a new feature, we can use what we already have. This article describes how we are likely to proceed, but since this is still experimental, it is of course not bound to remain as described. The procedure can be improved as we experiment it, and all feedback is welcome.


Step 1 - Generating lists

We will generate lists, that could be named following the template :
[checking] $languageCode ($languageName), $whateverYouWant

For instance :
[checking] fra (French), list 1
These lists will be private, and each of them will be attributed to the person who has to check the sentences in that list. They will be filled (in priority) with "orphan sentences", which won't remain orphan very long because we will have them automatically assigned to the person in charge of the checking.


Step 2 - Checking

Users will have to check sentences in their native language only. Most importantly, they will NOT be checking the accuracy of translations, only the sentences themselves, i.e. if there is any spelling or grammar mistake. We will deal with accuracy of translations much later.

While correcting sentences, the meaning of the sentence must never be altered.

There may be some cases where you are not sure whether the sentence should be edited or not. For instance, sentences that are archaic or sentences that are grammatically correct, but do not sound like what a native speaker would say. There is no absolute rule... Usually, if the sentence is archaic, you can leave it alone. If it is grammatically correct but not "natural", then try to make it sound natural. At any rate, if you are not sure what to do, post a comment on the sentence and we will see what should be done about it.


Step 3 - Marking sentences as checked

If you are volunteering to check sentences, then in the first place, you will actually have to check and correct all the sentences you own (and that are in your native language). Because we will consider as "checked" all these sentences, except those that are in your "checking" list which are of course being checked.

All the sentences that are not adopted or that belong to someone who is not part of the "checking team" will be considered as not checked.

Once you are done checking sentences in your list, we will by default renew the list.
Or you can rename your checking list into "[checked] $lang ...", in case you want to keep track of the various batches of sentences you have checked. We will then generate a new list for you.

Anyway, as you can see, with the features we have, we can already start reviewing sentences "en masse". The problem, however, is that the list feature is not exactly optimized for this...


Step 4 - Exporting lists into CSV file

This will be a useful feature in general, but in our particular case, it will enable people to check sentences offline, as well as execute a "replace all" if they come across a recurrent mistake.

On that matter, don't hesitate to send us an email and tell us about recurrent mistakes you find and that can be corrected systematically. This will help us create a script to correct these mistakes in the non-adopted sentences so that contributors won't have to waste time working on things that can be processed automatically.


Step 5 - Re-importing CSV file

People who decide to correct from the CSV file can later import back their corrections. All the corrections made in the file will be applied to the sentences in Tatoeba. At this time, it is still not very clear how the re-import process will be handled.


Step 6 - Re-adapting the list page

Some people may prefer checking from a page in Tatoeba rather than a from a file so that they can easily post a comment when needed, unadopt the sentence and leave it to someone else to deal with if it's just too difficult for them, favorite it, or add it to another list...

But the list page is not exactly optimized for that, so we will try to provide a page where users can easily to these things while checking.


Step 7 - Reorganizing the lists

There may be a point where the lists section starts getting a bit too messy. We will certainly have to introduce some categorizations for the lists (we'll have at least "checking" and "to translate"). This is still an open idea though, nothing guaranteed.


Step 8 - Integrating a vote system

We will start working on a vote system only when we have a larger number of active contributors. Right now, the number of active contributor in a given language probably doesn't exceed 5. There has to be at least 20, in my opinion, for the vote system to be worth implementing.

This is probably NOT something we will work on before another 3 or 4 months.


Step 9 - Locking sentences

Eventually, when a sentence has been checked and/or discussed again, again and again, it would make sense to lock it so that no one can edit it anymore, not even the owner.

The fact that a sentence is locked will also be the guarantee that a sentence is completely reliable. But again, this won't be integrated before at least several months.


Phase 2 - Step 1 - Checking translations

Well, I'll write another post about this when we get there, because this is a trickier issue...


Let's do the math

Supposing it takes an average of 5 seconds to check a sentence and correcting it if needed (which is quite optimistic). We have about 370,000 sentences. That's 500+ hours of checking. So with enough people, it's not that much.

(Optimistically)
Japanese would need about 210 hours.
English, 200 hours.
French, 40 hours.
German, 16 hours.
Polish, 13 hours.
Well, we have other languages, but I won't list them all here...

French will be the first language where we will start pouring our efforts (the project is based in France, after all). We can surely get the French sentences checked by the end of May.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.