Friday, April 30, 2010

Reliability of the sentences, how will we handle this?

Reliability has always been a big issue in Tatoeba. Many sentences have mistakes but there is currently no indication on whether a sentence is correct, or whether it is an accurate translation. So when you look at a sentence, you can never be 100% sure if you can rely on it.

We will start introducing some measures to solve this.

The first objective is to have all the sentences in Tatoeba adopted by someone. Languages that are especially concerned are Japanese, English and French. They are the main languages in Tatoeba and I have explained in a very old post where they come from. The post also explains the idea behind "adoption" of sentences. You can also read this discussion where I explained more in details my point of view on the "adopt" feature.

However, adopting is not enough. Sooner or later we will need some sort of "vote system". But before integrating a new feature, we can use what we already have. This article describes how we are likely to proceed, but since this is still experimental, it is of course not bound to remain as described. The procedure can be improved as we experiment it, and all feedback is welcome.

Step 1 - Generating lists

We will generate lists, that could be named following the template :
[checking] $languageCode ($languageName), $whateverYouWant

For instance :
[checking] fra (French), list 1
These lists will be private, and each of them will be attributed to the person who has to check the sentences in that list. They will be filled (in priority) with "orphan sentences", which won't remain orphan very long because we will have them automatically assigned to the person in charge of the checking.

Step 2 - Checking

Users will have to check sentences in their native language only. Most importantly, they will NOT be checking the accuracy of translations, only the sentences themselves, i.e. if there is any spelling or grammar mistake. We will deal with accuracy of translations much later.

While correcting sentences, the meaning of the sentence must never be altered.

There may be some cases where you are not sure whether the sentence should be edited or not. For instance, sentences that are archaic or sentences that are grammatically correct, but do not sound like what a native speaker would say. There is no absolute rule... Usually, if the sentence is archaic, you can leave it alone. If it is grammatically correct but not "natural", then try to make it sound natural. At any rate, if you are not sure what to do, post a comment on the sentence and we will see what should be done about it.

Step 3 - Marking sentences as checked

If you are volunteering to check sentences, then in the first place, you will actually have to check and correct all the sentences you own (and that are in your native language). Because we will consider as "checked" all these sentences, except those that are in your "checking" list which are of course being checked.

All the sentences that are not adopted or that belong to someone who is not part of the "checking team" will be considered as not checked.

Once you are done checking sentences in your list, we will by default renew the list.
Or you can rename your checking list into "[checked] $lang ...", in case you want to keep track of the various batches of sentences you have checked. We will then generate a new list for you.

Anyway, as you can see, with the features we have, we can already start reviewing sentences "en masse". The problem, however, is that the list feature is not exactly optimized for this...

Step 4 - Exporting lists into CSV file

This will be a useful feature in general, but in our particular case, it will enable people to check sentences offline, as well as execute a "replace all" if they come across a recurrent mistake.

On that matter, don't hesitate to send us an email and tell us about recurrent mistakes you find and that can be corrected systematically. This will help us create a script to correct these mistakes in the non-adopted sentences so that contributors won't have to waste time working on things that can be processed automatically.

Step 5 - Re-importing CSV file

People who decide to correct from the CSV file can later import back their corrections. All the corrections made in the file will be applied to the sentences in Tatoeba. At this time, it is still not very clear how the re-import process will be handled.

Step 6 - Re-adapting the list page

Some people may prefer checking from a page in Tatoeba rather than a from a file so that they can easily post a comment when needed, unadopt the sentence and leave it to someone else to deal with if it's just too difficult for them, favorite it, or add it to another list...

But the list page is not exactly optimized for that, so we will try to provide a page where users can easily to these things while checking.

Step 7 - Reorganizing the lists

There may be a point where the lists section starts getting a bit too messy. We will certainly have to introduce some categorizations for the lists (we'll have at least "checking" and "to translate"). This is still an open idea though, nothing guaranteed.

Step 8 - Integrating a vote system

We will start working on a vote system only when we have a larger number of active contributors. Right now, the number of active contributor in a given language probably doesn't exceed 5. There has to be at least 20, in my opinion, for the vote system to be worth implementing.

This is probably NOT something we will work on before another 3 or 4 months.

Step 9 - Locking sentences

Eventually, when a sentence has been checked and/or discussed again, again and again, it would make sense to lock it so that no one can edit it anymore, not even the owner.

The fact that a sentence is locked will also be the guarantee that a sentence is completely reliable. But again, this won't be integrated before at least several months.

Phase 2 - Step 1 - Checking translations

Well, I'll write another post about this when we get there, because this is a trickier issue...

Let's do the math

Supposing it takes an average of 5 seconds to check a sentence and correcting it if needed (which is quite optimistic). We have about 370,000 sentences. That's 500+ hours of checking. So with enough people, it's not that much.

Japanese would need about 210 hours.
English, 200 hours.
French, 40 hours.
German, 16 hours.
Polish, 13 hours.
Well, we have other languages, but I won't list them all here...

French will be the first language where we will start pouring our efforts (the project is based in France, after all). We can surely get the French sentences checked by the end of May.

Sunday, April 18, 2010

Switching from Lucene to Sphinx

As if migrating to a new server wasn't enough, we also decided to migrate to a new search engine. It was a rather on-the-fly decision, but I must admit, it was fun :D

A little bit of context

Until now we were using a search engine called Lucene. It's written in Java, and the integration of Lucene into Tatoeba is something that was coded three years ago, back when I didn't know how to code and wasn't even sure yet I would pursue a career in computer science.
I was just very lucky that one student in computer science at my university found out about my project and was interested to join me in the task of integrating a search engine, as part of a school project (thank you Fran├žois, if you read me).

The problem is, running Lucene takes a lot of memory. And our new server doesn't have a lot of memory (512MB RAM). So we figured, okay, we'll just leave the search engine on the old server (2GB RAM), Masa (the admin) will not mind.

But Masa wanted to clean up his server, to reinstall it from scratch, but couldn't. He didn't want Tatoeba to be in trouble (because that meant we had to find somewhere else to go, even if it would be temporary). So when I told him we were moving to our own server, he was quite excited, he could finally reinstall peacefully. I told him our migration was scheduled on Saturday April 17th, and that we would find a temporary solution for the search engine, so he can do whatever on Sunday.

Migration day

Saturday, migration day. Lots of things to do. And I couldn't be in Paris with 3 other members of my team (Allan, Robin and Baptiste), so it only made the task harder. I won't go into details, but we reached the end of the day, everything went pretty well, except we hadn't taken care of the search engine yet...

We were in IRC, and Robin and Baptiste had left. I was telling Allan all the hackish stuff we would need to do to set up the search engine, because the initial plan was that we temporarily use his machine at work to host it. But then he felt "Okay this too hackish, I'll try to find another solution otherwise we will never update the search engine".

Except, I had received an email from Masa ealier, telling me he would really like if we could be done migrating by 1AM, so I tell Allan "But Masa really really wants to reinstall his server, we need to have something working by midnight". And it was 8PM...

How we decided to use Sphinx

Allan was not going to give up so easily. He started telling me that he had already done some searches before, and that Sphinx was often mentioned as a competitor of Lucene.
me: Sphinx or Lucene, if you can code me something within 2-3 hours, I have nothing against it.
So he kept going, telling me that Sphinx handles stemming, that it's written in C++, that someone made a behavior to integrate it in CakePHP...
me: Alright, but it will be for next week :P
Allan: So I didn't really have a choice...
me: Ah because you want to do this now?
Allan, quoting me: Sphinx or Lucene, if you can code me something within 2-3 hours, I have nothing against it.
me: Well okay, we can try it.
Allan: Yea because you know, there wasn't any big fail in our migration, so we need to add more pressure, otherwise it's not fun.
me, thinking: Like I didn't have enough pressure for the day *sigh*. (Allan was in the train while *I* was doing the migration)
me: Give me the links you have, I'll see what I can do to speed up the integration.
It was 8:30PM.

How things went

Things went very well :) Note that none of us knew much about Sphinx before. We had no idea how difficult (or how easy) it was to install it, and run it, and integrate it in CakePHP. Allan took care of the installation & configuration part while I was taking care of the integration in CakePHP.

I still had to know how to install it locally though. As a Windows user, I must say this link helped me a lot:

Once I understood how Sphinx worked and how to get it to work (which took me a bit more than one hour), all I had to do was to follow the explanations on the Sphinx Behavior documentation, adapt the code to Tatoeba, figure out how to pass GET variables with CakePHP's Paginator, and add some "warning" message to let users know that we're switching to a new search engine and some features are no more available (but of course we will integrate them back as soon as possible).

In the meantime, Allan installed Sphinx on our new server, figured out how to create one index for each language so that people can still search from a specific language, figured out how to fetch in that index from CakePHP, and figured out how to make the search work for languages that had non ASCII characters.

It was then 1AM, and we had done it. Installed Sphinx, integrated it into CakePHP, have it work for all the languages we are supporting, did the tests to make sure basic searches are working, and updated Tatoeba.

Now everything is soooo fast, it's awesome. Besides, indexing with Sphinx only takes 30-60 seconds (compared to 15-20 minutes with our 3 year-old Lucene code). So we can afford to index much more often.

The whole experience was awesome as well. The challenge, the teamwork, the achievement. I loved it :D

Friday, April 16, 2010

Tatoeba update (Apr 17th, 2010)

Well, I was supposed to be in Paris with my team at the moment, but some volcano decided it would be otherwise - flight canceled. So be it, Tatoeba will still be updated today.

New server

That's the most important news: we're moving to a new server, kindly provided to us by the Free Software Foundation in France.
It may surprise some of you, but we weren't on our own server until... well, now. We were hosted by Masa (not his real name), webmaster of I have to thank him for hosting Tatoeba - for free - for the last two years. I also have to thank him for giving Tatoeba the (18,000) French translations of the Tanaka Corpus that he gathered (with many volunteers), also two years and a half ago.

Cleaned up sentences

Duplicate sentences will be merged, and the { } annotations that you can find in some sentences will be removed.

Private messages

The private messages look better now :) The private messages system needs to be changed someday though, to be more practical. Some that similar the Wall, except it would be private. However, people do not use private messages that much, so it is not urgent.

What next?

I won't be talking about our update in two weeks (because we haven't really decided yet), but rather for the next two months.
  • As usual, we will be debugging and optimizing our code.
  • We will take some time to reach out to other people. We start having a quite long list of people to contact, and it's time we actually contact them.
  • We will work on improving the profile and the lists.
  • Tatoeba is currently built on a PHP framework, CakePHP, but we will start switching to Django (something we've been considering for a few months already). It's not like we're going to entirely recode Tatoeba. We still have to discuss on how we'll be doing this.
  • And we will move our code source to GitHub (also something we've been considering for a couple of months).

Saturday, April 3, 2010

Tatoeba update (Apr 3rd, 2010)

So this is a pretty important update, with quite a lot of new stuff.


Well I have been saying this many times already, but we now have audio! Well, at least a little...

Japanese & MeCab

We have switched to another software for Japanese romanization, MeCab. Actually, we are not displaying romaji anymore.

Pagination on Wall

We have finished paginating the Wall so you won't have to wait forever to get to read the new messages.

Link/unlink sentences

We have implemented the link/unlink feature. The owner of a sentence can turn indirect translations into direct ones by linking it to his/her sentence. He or she can also unlink a translation, if the translation does not mean the same thing. This feature will not be available to everyone however... Only to a few chosen ones.

Trusted users

There is now a new user status, that we call "trusted user". For now, the only good thing about being a trusted user is that you can link and unlink sentences, while normal users can't. But in the future, we will start by testing new features with trusted, who can then give us feedback so that we improve the features. And only then we will release it for everyone.

Note that there are no specific criteria to become a trusted user. But one very important condition is to have read ENTIRELY the "How to be a good contributor" guide.

What next?

As usual, many many things. But the main thing is that we are going to move to a new server. We recently asked the Free Software Foundation (France) if they could host our project, which is a free project (AGPL license for code source and CC-BY for corpus files). They accepted, so we now have our own server. The migration is scheduled for April 17th. Once that is done, Tatoeba will be (much?) faster - because it is pretty slow right now.

Friday, April 2, 2010

Japanese romanization in Tatoeba, now using MeCab

We used to display romaji in Tatoeba... We don't anymore. Well, at least not directly. We are now going to display the reading in hiragana. You can however get the romaji version by hovering your mouse over the hiragana, and wait for the little tooltip to appear.



There has been some discussion about it (like here, here or here), and I think this solution will make everyone happy.

Now, of course, the output generated is not perfect. So if anyone out there is interested to improve the hiragana generated, then please let us know! As much as I agree that the reading is a vital information for Japanese learners, I will NOT have time to make it any better. I'd really, really like if someone could take on this tasks.

For your information, we were using KAKASI in order to convert Japanese text into romaji. We have now switched to MeCab. Our romaji/furigana converter is still based on KAKASI though.

Thursday, April 1, 2010

Audio for Tatoeba sentences, in partnership with Shtooka

We started to add audio in Tatoeba, and it will be available on April 3rd. Great, isn't it? :D

Yes, but (there is a but) you will probably be disappointed to see that most of the sentences will be indicating "audio unavailable". So far, only a few hundred sentences have audio, which is barely 0.1% of the whole corpus. This however not a fatality! If you are interested in helping us adding more audio, keep reading.

First of all, about Shtooka

Shtooka is a small non-profit orgnization based in Paris which goal is to gather collections of audio for words, expressions, proverbs, sentences, etc. You can browse their collections here.

We have met them at an event they organized on February 13th, and thanks to them, we are now starting to integrate audio into Tatoeba.

Audio for Shanghainese

The audio we have so far in Shanghainese. Yes, we do have such an exotic language. Now, you may be wondering why on Earth did we pick Shanghainese? Well, for a few reasons.
  • Allan (aka. sysko), one of the most active developer in the team, is very interested in Chinese, and more particularly in Shanghainese. He was provided 900 Shanghainese sentences from
  • Congcong (aka. fucongcong), one of the most important contributor in Tatoeba, speaks Shanghainese.
  • They were both able to meet regularly Nicolas (aka. zmoo), president of Shtooka, in order to record these sentences in Paris.

Want more?

Needless to say, we will be very happy to add audio for any other language. But it's not going to be easy, and it's not going to be possible without your help! So if you are interested...
  1. First of all, send us an email at, with the title "Audio for Tatoeba in [insert-language-here]".
  2. You have to know that Shtooka insists a lot on quality, therefore recording from your laptop's microphone is not an option. We will explain things more in details when we contact you back.
  3. Then if you are still motivated, start gathering sentences for which you would like to record audio, by creating lists. Limit each list to 100 sentences max.
  4. Note that you can also create lists just to gather sentences for which you want audio, even if you are not going to record them. Just make sure that all the sentences in a list are in a same language.
Anyway, having audio in Tatoeba is really exciting for us, and we hope that many of you will join us in this quest!