Tatoeba Project

Wednesday, August 25, 2010

Tatoeba update (August 25th, 2010)

Small update.

What's new
  • Autocompletion for tags. NOTE: tags are still available only to trusted users so this feature will only affect them.
  • Tags are organized by popularity. The number of sentences tagged is also indicated.
  • Changed a little the top menu. There's a sub-menu for the "Browse" section, to easily browse by language, by list and by tags.
  • Changed the position of the search input, for better usability.

What next

I can tell you we're preparing a new shiny version of Tatoeba. When will it be ready is still unknown, but it's definitely not for tomorrow. It will take several months.
If you're interested in beta testing it, feel free to drop a mail at team@tatoeba.fr, with the title "Tatoeba beta testing". We'll contact you back when the time comes :)

Saturday, August 7, 2010

Tatoeba update (August 7th, 2010)

This post talks about changes that were applied on July 26th, in addition of those done on August 7th.

What's new?
  • We are now displaying furigana for Japanese sentences. It previously looked like this: 私[わたし] に あいさつ する よう な 人[ひと] は い ない ...which was not very practical to read.
  • We have added a filter in the comments section. You can now display only comments that are posted on sentences in a certain language (for instance, only comments on Esperanto sentences)
  • We add new languages regularly, but this week, we're adding a quite special language: CycL. This was request by our member witbrock. I'm very curious to see where this is going to lead...
What next?

API. More and more people have been asking us if we were providing an API. We currently don't, but we definitely want to provide an API someday. I can't say when yet, I don't want to make promises, but I'll be posting progresses as they happen.

Copyright. More copyright issues have been raised lately. So I'll be writing a post about it, to try to explain clearly the issues we are facing related to copyright and what you can do to help.

Tuesday, August 3, 2010

Submission policy - What kind of content do we want?

This article explains what kind of content we accept in Tatoeba, what kind of content we delete and what kind of content we review. Note that this article is not final. You have the right to object to something or to ask for more clarifications.


What do we accept?

Tatoeba is about collecting sentences so we only want sentences. However, what exactly do we mean by "sentences"? What is a sentence and what is not? It's actually a difficult question... No one will doubt that "I am happy" is a sentence. But what about "On the left", is that a sentence? What about "Thank you", "Yes", or "Awesome"?

As far as I'm concerned, I think Tatoeba can handle a loose definition of "sentence". We don't strictly need to have an entity with at least a verb. To me, when spoken, everything is a sentence. When written, the main difference between a sentence and a non-sentence is typography. That's all. For the rest, as long as people can imagine context where the "sentence" can be expressed, then it's a sentence.
So yes, I'm roughly saying that you can take all the words in the dictionary, add punctuation and perhaps a capital letter, you'd turn it into a sentence. I don't encourage it because it's not useful (dictionaries do that already), but one-word sentences are still tolerated. I'll trust people's common sense for adding only one-word sentences that are significant (for instance, "Hello" is, "House" isn't).

In case you run across sentences that are not strictly speaking sentences, then tag them as "non-sentence", so that there is a way to quickly identify them. Inform the owner about this article if he's a new member, and let him know it's better to to have sentences with more context.
At any rate, don't bother starting endless discussions if the sentence has already been translated because it will be kept as is. Feel free however to add a new sentence based on the "non-sentence".

Generally speaking, Tatoeba is open to many kinds of sentences. We tolerate casual speech, slang, insults (as long as they are not targeting anyone in particular), erotic sentences, sentences that are not "true" (after all, Tatoeba is not an encyclopedia). These sentences can be tagged accordingly to inform users. But I'll ask people to focus primarily on appropriate and politically correct sentences. We don't have (yet) a good system to filter out sentences that are not very "safe", so don't flood us with those, please.


What do we delete?

What we delete for sure are:
  • Entries that people add by mistake due to our failure to provide a more efficient interface.
  • Sentences that owners themselves requested to delete (because the delete feature is still not available to everyone).
  • Entries that are copyrighted or under a license that is not compatible with CC-BY.
  • Racist comments and personal attacks, if they are really harmful and there is a general agreement that it should be removed.
  • Entries that really make no sense and whose owner won't provide any explanation.
In the perspective of providing better content, I'm also allowing the deletion of "sentences" that are "not really sentences" and came from the Tanaka Corpus, but only under these conditions:
  • The vocabulary is already illustrated in other sentences.
  • There is only the Japanese-English pair, no translation into any other language. We can make an exception for French (i.e. it's still deletable if there is a French translation).
  • All the sentences that will be deleted do NOT belong to anyone.
It may be obvious, but you should avoid translating a sentence that is likely to be deleted... Unless you want to stand against its deletion.


What do we review?

By "reviewing" I mean correcting mistakes. So we correct spelling mistakes, grammar mistakes, bad formulations, etc. We want Tatoeba's data to be used for educational purpose so we want good quality sentences.

However, the limit between a "correct" and "incorrect" sentence is not always clear and some sentences can generate a lot of debate. In such cases, the final decision belongs to the owner of the sentence. Remember that Tatoeba allows several translations in a same language, so there is no point fighting endlessly on what is correct or not. Simply add another version of the sentence if you are not happy with the existing one, we don't mind at all having near duplicate sentences (cf. this discussion on the Wall, and more precisely my thoughts on the issue here).

We also don't want any kind of annotations in the sentences.
If you need to indicate that a sentence is a proverb, female speech or a quote, then post a comment about it, don't add this information directly in the sentence.
Things like "I saw her/him the other day" should be split into two sentences ("I saw her the other day" and "I saw him the other day").
We need sentences to be as raw as possible because the sentences can be used in other projects involving language processing. It's also less easy for people to translate sentences that contain alternatives (like "him/her"). Finally, if we want to record audio for the sentence, we'll need to choose what exactly to record, and annotations don't help.

Monday, August 2, 2010

لمحبي و متعلمي اللغات: تعريف بمشروع تتويبا

من منا لم يجرب أن يتعلم لغة أخرى؟ و من منا لم يمر بكلمات لم يعرف كيف يستخدمها ثم يقول لنفسه: " آه لو كان هناك أمثلة مع ترجمتها إلى العربية في هذا القاموس". مشروع تتويبا project Tatoebaيحاول سد هذه الثغرة. فكرة المشروع هي موقع يقوم بجمع الجمل و ترجمتها إلى لغات عديدة. فبعد أن تسجل يمكنك إضافة جمل جديدة، ترجمة الجمل الموجودة على الموقع، و تصحيحها. الموقع مبني على المشاركة، فكثير من الجمل الموجودة أضافها أعضاء يتكلمون تلك اللغات كلغتهم الأم.
لدى الموقع حالياً قرابة النصف مليون كلمة في ثلاثة و خمسين لغة بالإضافة إلى حوالي أربعة آلاف جملة ناطقة من خلال مشروع شتوكا project Shtooka. كل الجمل و الملفات الصوتية على

الموقع مرخصة بترخيص CC-BY2.0/FRو كود الموقع نفسه مرخص بترخيص AGPL v3.

لزيارة الموقع
للإطلاع على كود الموقع و تحميله بالكامل


جميع قواعد بيانات الموقع يمكن تحميلها من هنا

و يمكنك أيضاً المشاركة في ترجمة الموقع إلى العربية هنا


Thank you to saeb for writing this little description of Tatoeba in Arabic.

Saturday, July 17, 2010

Tatoeba update (Jul 17th, 2010)

First of all, I'd like to mention that we've had a lot of traffic lately. Allan published an article on linuxfr.org about Tatoeba, and it sure brought a lot of new people :D
Google Analytics says 1,172 unique visitors on July 17th, while we usually have around 400-450. We're glad to see the server is still doing well despite the quite significant increase of activity!


What's new

We can now import sentences. Since July 4th actually, but I didn't have much time to write about it. The feature is currently only available for moderators, because we cannot safely let everyone import huge amount of data. So the way it works is that you send us your sentences in a simple text file, by email (team@tatoeba.fr), and we import it.

We accept two formats:
  1. Single sentences: each line has one sentence. All the sentences have to be in the same language.
  2. Sentences + translations: each line has a sentence and its translation, separated by a tab (sentence [tab] translation). All the sentences have to be in a same language, and all the translations in a same language. For instance only French-Spanish, and not French-Spanish in one line, and Swedish-Spanish the next line.
IMPORTANT: We release our data under the Creative Commons Attribution (CC-BY) license. We will not be importing your content if it brings up copyright issues or license incompatibilities. I mean, for instance don't send us sentences stripped from textbooks, or sentences that under the CC-BY-SA license (it's not compatible with CC-BY).

So far we imported:
  • ~700 pairs of sentences in Chinese-Shanghainese. In total we have ~900 pairs of sentences thanks to shanghaining.com. The first 200 ones were added by hand.
  • 200+ proverbs in Dutch.
  • 250+ proverbs in Ukrainian.
That's the major thing for the last couple of weeks.


What next?

We still have to import 2500+ pairs of English-Spanish sentences, provided by one of our registered users, Łukasz. And probably thousands and thousands of other sentences, as more and more people discover Tatoeba, and have their own private (or not so private) collections of sentences to share with everyone :)

In terms of features, there will not be much going on in the next couple of weeks. Actually it will depend on the rest of the team, but as far as I'm concerned, I will have other priorities.

There is still a lot of things that can be improved about the current features, and we will keep improving them, but in August we will also start discussing about the next new stuff. I will write more about it when we get there.

Right now I'd just like to say thank you to everyone who gave this project a little bit - or a lot - of their time, of their knowledge, of their encouragements... Because Tatoeba has become an awesome place for language lovers and learners, and for that, the credits really goes to the community :)

Sunday, June 27, 2010

Tatoeba update (Jun 27th, 2010)

What's new
  • Page that lists all the tags. NOTE: It's not organized at all, it's really just for sake of having a page that displays all the existing tags.
  • Page that lists all the sentences in a specific language, with possibility to show only those that are NOT translated yet into a certain language. For instance Japanese sentences not yet translated into English. Useful feature for contributors =)
  • Possibility to filter by language, on the page that lists sentences with a certain tag.
What's next
  • Possibility to import sentences from CSV file. This feature won't be available to normal users. For a start (and I think for a long time), only moderators will have access to it. So anyone who wants to import sentences from a file will have to make a request. Anyway, the main point is that as soon as we have this feature, we will add massively lots of new sentences =]

Friday, June 11, 2010

Tatoeba update (June 12th, 2010)

What's new

I am glad to announce that we are finally introducing... tags!! :D

This will provide a way for people to add meta-data to sentences. For instance "proverb", "formal", "informal", "male", "female", etc. Such information can be very useful for language learners because they cannot necessarily guess such things just by reading the sentence.

Tags will be restricted for a short period of time. Only trusted users will be able to add tags, but everyone can see the tags associated to a sentence. When we feel the feature is ready for everyone, we will allow everyone to add tags.

People will be free to tag sentences with whatever they want. We don't really have any strict rules yet because tags are still new, and we want to see how people use them. But I can at least suggest some basic tags:
  • proverb, archaic, slang
  • formal, informal
  • male, female (to indicate whether the sentence is said by a man or a woman)
  • to delete, to correct, checked (I will talk more about these)
  • controversial, unsafe (to mark sentences that can cause problems, are not suitable for kids, etc).
  • easy, intermediate, difficult (to indicate the level of difficulty of a sentence)
So these are only my suggestions. Again, the tag feature is new, so we will necessarily go through a phase of experimentation before we can clearly set any rule. We count on everyone to try and help us figure out what works best. Feel free to discuss about issues related to tags on the Wall.

A few more things you need to know about tags:
  • You can see the list of sentences associated to a certain tag by clicking on the tag.
  • You can remove a tag from a sentence only if you were the one who added it.
  • Moderators can remove any tag.
  • It's not possible to add twice a same tag for a sentence. If someone has already added "proverb", you can't re-add "proverb".

"to delete" tag

Those tags will help moderators in their work. At the moment, in Tatoeba, only moderators can delete sentences. The traditional way of requesting a deletion was to add a comment to it, and point out that it should be deleted (and explain why). But the flow of comments has increased a lot and it's less easy for moderators to keep track.

So if you come upon a sentence that you feel should be deleted, then tag it with "to delete" so that moderators can easily find them and clean Tatoeba from entries that are not valid. Anything that is gibberish is not valid. Anything that is not a complete sentence is not valid. But then again, we haven't decided what exactly is a "sentence" so it's debatable.


"to correct" tag

In Tatoeba, it is not possible to modify a sentence that doesn't "belong" to you. These sentences are typically sentences that you have added yourself. No one (or almost) can touch them besides you. If someone sees a mistake in your sentence, all they can do is post a comment, and you have to correct it.

But certain members contribute sentences with mistakes and never come back. And for now, no one can correct their mistakes... except moderators. So if you want to help moderators, whenever you come across a sentence that needs to be corrected, that has a comment asking for correction, but even after two weeks, it was still not corrected, then you can tag the sentence with "to correct".


"checked" tag

Before I explain further, I must stress that this tag is experimental. Many times people have asked for a way to tell whether a sentence can be trusted or not. Okay, so now we can tag a sentence as "checked" to indicate that it has been proofread and validated as a correct sentence.

Of course, this raises some of course problems...
  • What if a user tags a sentence as "checked" just for the fun of it?
  • What if a user tags a sentence as "checked" but was tired and overlooked a mistake?
Well, we can't guarantee 100% accuracy. A sentence that is tagged "checked" will simply have a higher reliability rate than one that doesn't, but it won't be 100% (no one can guarantee that anyway).


What's next
  • We will make tags available to everyone.
  • We will add a page that lists all tags, to enable people to easily browse by tags.
  • We will provide a way to merge tags.
  • And many other things, but I will talk about it when the time comes.
In the meantime, enjoy :)