Friday, May 31, 2013

Tatoeba update (May 31st, 2013)

After a long time without much update, we're finally starting to have some changes in the code. Hopefully we'll have even more updates in the next few month as people have been contacting us to let us know they would like to help in the development and maintenance of Tatoeba.

So what's new?

Inappropriate comments
The main change in this update is that admins will now have the possibility to hide comments that have been considered inappropriate. Such comments will only be displayed to the author and to the admins. Other people will only see a message informing them that the comment was hidden because it didn't comply to our rules. This may not be our definitive way of handling inappropriate comments, but we're at least going to give it a try and see how it works.

Downloads files
For those who use the files we provide on our downloads page, there was a change in the way the lists data is exported. There is more information now, and the exports was split into two files rather than just one. Cf. "Lists" and "Sentences in lists" on the downloads page.

Light display
We have made a light version of the sentence's page, in which only the sentence and its direct translations are displayed. Here's what sentence #1 looks like in the light version. This is useful for those who would like to include our content on their website, like it's done for example here.

Friday, May 17, 2013

The story of Tatoeba

Someone sent us an email to ask more about the story behind Tatoeba. It's true that there isn't so much information on that matter so I figured I would take some time to write about it.


It all started when I was traveling to Germany at the end of January 2006. At that time I was really fond of learning languages, and I was especially in love with Japanese. But there I was visiting a good friend of mine in Germany and couldn't speak German, so I was wondering if there was any good German-French or German-English dictionaries.

With Japanese, I had found what I considered back then as the most awesome dictionary of all times, http://www.alc.co.jp/. I loved it so much simply because it wasn't limited to words. I could search things like "hello" or "table", but I could also search expressions or partial sentences like "out of the blue" or "sometimes I think that". And it would return results. Some of the results are regular dictionary results, but the other results are actually sentences containing the searched word(s), and the translations of these sentences. That helped me a lot.

After searching good German-French/German-English dictionaries and not finding anything satisfying, I started to search such a dictionary for other pairs of languages. That led to me ask myself what would be my ideal dictionary and for some reason, I just couldn't stop thinking about it. So in the next following days I ended up writing everything I had in mind in a short document that I'm publishing here for the first time: Trang's ideal dictionary (just want to mention I was 19 years old at the time).

When I went back to France, I sent this document to several of my penpals. I tried to find people who would be interested to work on it with me, to either code it for me or to teach me what I would need to know to code it myself (mostly to code it for me because I didn't believe I would be able to do it myself). The best help I found was someone suggesting me to take a look at PHP. I had no idea what was PHP but I googled it, went to the PHP website, downloaded something. Then in the files and folders of PHP, I searched and clicked on any .exe file I would find, hoping that it would open a program in which I could type something, then click a button and make it display whatever I asked it to display... But nothing of what I expected happened so I gave up.

A couple of months later my little sister was trying to make a website, her online diary or something. She was following a tutorial about HTML, PHP and MySQL written for complete beginners. When I saw there was something about PHP I took a look at the tutorial as well and then things started to make a lot more sense. I spent a whole week experimenting, trying to make a small website with two pages. One page where I could save sentences and translations, and another search these sentences. I found out it wasn't that difficult, that I didn't need to find someone with years of experience in programming, that I could actually do it myself.

The very first version of Tatoeba wasn't called Tatoeba. I wouldn't even call it a first version, it was more of a prototype. It was hosted on Sourceforge under the codename multilangdict. I called it a "dictionary" but I knew it wasn't really going to be a dictionary. It was already clear to me that the focus of the project would be to collect sentences and their translations, since it was the type of data that I couldn't find easily.

As soon as I had a somewhat functional website, I asked all the people I knew could be interested to come and add or translate sentences. To my surprise some people reported to me they found it addictive to translate. I think the addiction came from the fact that a lot of the sentences that were added weren't the "textbook" kind of sentences. The database was empty, so we had to fill it with whatever we could. Most of the sentences were just a part of the life of whoever added them. It could be something they said, something they heard, something they thought of. That gave them a touch of authenticity, I guess, and made them more interesting.

A year later, during summer 2007, a new version... or rather the first version of the project was coded (the previous version was rather an experiment). It was around that time that I decided to call the project "Tatoeba". I chose this name because the goal of the project was to give example sentences, and "tatoeba" means "for example" in Japanese.

After the first version was coded, I imported the sentences from the Tanaka Corpus into the project, a collection around 150,000 pairs of Japanese-English sentences. The database then grew from ~5000 sentences to ~300,000 sentences. The community was still non-existent but at least there was more data that could potentially attract more people. Although one of my friends confessed to me that she liked the project much more before there were all these boring sentences from this corpus.

I released the second version of Tatoeba (which is the one in use at the moment) in December 2008. Sysko joined me in the project during summer 2009 and helped me a lot. I was very happy about this because up until that point I was pretty much "alone" on the project. I had people helping me punctually, but no one who would really get involved as much as Sysko did. And he has really done sooooo much for the project.

Anyway it's been a long way and a lot of things happened, but today we're at 2.3 million sentences and growing, with thousands of people using it everyday. The main problem now is quality (which is a topic that can bring, and has brought, very heated debates). But that's something I will talk about in another post.
This post is just for those who were wondering about how Tatoeba started :)