Wednesday, August 25, 2010

Tatoeba update (August 25th, 2010)

Small update.

What's new
  • Autocompletion for tags. NOTE: tags are still available only to trusted users so this feature will only affect them.
  • Tags are organized by popularity. The number of sentences tagged is also indicated.
  • Changed a little the top menu. There's a sub-menu for the "Browse" section, to easily browse by language, by list and by tags.
  • Changed the position of the search input, for better usability.

What next

I can tell you we're preparing a new shiny version of Tatoeba. When will it be ready is still unknown, but it's definitely not for tomorrow. It will take several months.
If you're interested in beta testing it, feel free to drop a mail at, with the title "Tatoeba beta testing". We'll contact you back when the time comes :)

Saturday, August 7, 2010

Tatoeba update (August 7th, 2010)

This post talks about changes that were applied on July 26th, in addition of those done on August 7th.

What's new?
  • We are now displaying furigana for Japanese sentences. It previously looked like this: 私[わたし] に あいさつ する よう な 人[ひと] は い ない ...which was not very practical to read.
  • We have added a filter in the comments section. You can now display only comments that are posted on sentences in a certain language (for instance, only comments on Esperanto sentences)
  • We add new languages regularly, but this week, we're adding a quite special language: CycL. This was request by our member witbrock. I'm very curious to see where this is going to lead...
What next?

API. More and more people have been asking us if we were providing an API. We currently don't, but we definitely want to provide an API someday. I can't say when yet, I don't want to make promises, but I'll be posting progresses as they happen.

Copyright. More copyright issues have been raised lately. So I'll be writing a post about it, to try to explain clearly the issues we are facing related to copyright and what you can do to help.

Tuesday, August 3, 2010

Submission policy - What kind of content do we want?

This article explains what kind of content we accept in Tatoeba, what kind of content we delete and what kind of content we review. Note that this article is not final. You have the right to object to something or to ask for more clarifications.

What do we accept?

Tatoeba is about collecting sentences so we only want sentences. However, what exactly do we mean by "sentences"? What is a sentence and what is not? It's actually a difficult question... No one will doubt that "I am happy" is a sentence. But what about "On the left", is that a sentence? What about "Thank you", "Yes", or "Awesome"?

As far as I'm concerned, I think Tatoeba can handle a loose definition of "sentence". We don't strictly need to have an entity with at least a verb. To me, when spoken, everything is a sentence. When written, the main difference between a sentence and a non-sentence is punctuation. That's all. For the rest, as long as people can imagine context where the "sentence" can be expressed, then it's a sentence.
So yes, I'm roughly saying that you can take all the words in the dictionary, add punctuation and perhaps a capital letter, you'd turn it into a sentence. I don't encourage it because it's not useful (dictionaries do that already), but one-word sentences are still tolerated. I'll trust people's common sense for adding only one-word sentences that are significant (for instance, "Hello" is, "House" isn't).

In case you run across sentences that are not strictly speaking sentences, then tag them as "non-sentence", so that there is a way to quickly identify them. Inform the owner about this article if he's a new member, and let him know it's better to to have sentences with more context.
At any rate, don't bother starting endless discussions if the sentence has already been translated because it will be kept as is. Feel free however to add a new sentence based on the "non-sentence".

Generally speaking, Tatoeba is open to many kinds of sentences. We tolerate casual speech, slang, insults (as long as they are not targeting anyone in particular), erotic sentences, sentences that are not "true" (after all, Tatoeba is not an encyclopedia). These sentences can be tagged accordingly to inform users. But I'll ask people to focus primarily on appropriate and politically correct sentences. We don't have (yet) a good system to filter out sentences that are not very "safe", so don't flood us with those, please.

What do we delete?

What we delete for sure are:
  • Entries that people add by mistake due to our failure to provide a more efficient interface.
  • Sentences that owners themselves requested to delete (because the delete feature is still not available to everyone).
  • Entries that are copyrighted or under a license that is not compatible with CC-BY.
  • Racist comments and personal attacks, if they are really harmful and there is a general agreement that it should be removed.
  • Entries that really make no sense and whose owner won't provide any explanation.
In the perspective of providing better content, I'm also allowing the deletion of "sentences" that are "not really sentences" and came from the Tanaka Corpus, but only under these conditions:
  • The vocabulary is already illustrated in other sentences.
  • There is only the Japanese-English pair, no translation into any other language. We can make an exception for French (i.e. it's still deletable if there is a French translation).
  • All the sentences that will be deleted do NOT belong to anyone.
It may be obvious, but you should avoid translating a sentence that is likely to be deleted... Unless you want to stand against its deletion.

What do we review?

By "reviewing" I mean correcting mistakes. So we correct spelling mistakes, grammar mistakes, bad formulations, etc. We want Tatoeba's data to be used (or at least usable) for educational purpose so we want good quality sentences.

However, the limit between a "correct" and "incorrect" sentence is not always clear and some sentences can generate a lot of debate. In such cases, the final decision belongs to the owner of the sentence.

Remember that Tatoeba allows several translations in a same language, so there is no point fighting endlessly on what is correct or not. Simply add another version of the sentence if you are not happy with the existing one, we don't mind at all having near duplicate sentences (cf. this discussion on the Wall, and more precisely my thoughts on the issue here).

We also don't want any kind of annotations in the sentences. You can find more details in the contributor's guide, rule #9. If you have a good reason to keep your annotations, then please explain it in your comments. Otherwise moderators have the right to edit your sentence two weeks after you have been requested to change your sentence.

What do we link?

Tatoeba's sentences are represented as a graph. Two sentences that are linked together have the same meaning. Linking two sentences in the same language is accepted, but you shouldn't link only based on meaning. The sentences that you link should also have an equivalent "style" and type of speech. Cf. my wall post here.

NOTE: Only trusted users can link sentences.

Monday, August 2, 2010

لمحبي و متعلمي اللغات: تعريف بمشروع تتويبا

من منا لم يجرب أن يتعلم لغة أخرى؟ و من منا لم يمر بكلمات لم يعرف كيف يستخدمها ثم يقول لنفسه: " آه لو كان هناك أمثلة مع ترجمتها إلى العربية في هذا القاموس". مشروع تتويبا project Tatoebaيحاول سد هذه الثغرة. فكرة المشروع هي موقع يقوم بجمع الجمل و ترجمتها إلى لغات عديدة. فبعد أن تسجل يمكنك إضافة جمل جديدة، ترجمة الجمل الموجودة على الموقع، و تصحيحها. الموقع مبني على المشاركة، فكثير من الجمل الموجودة أضافها أعضاء يتكلمون تلك اللغات كلغتهم الأم.
لدى الموقع حالياً قرابة النصف مليون كلمة في ثلاثة و خمسين لغة بالإضافة إلى حوالي أربعة آلاف جملة ناطقة من خلال مشروع شتوكا project Shtooka. كل الجمل و الملفات الصوتية على

الموقع مرخصة بترخيص CC-BY2.0/FRو كود الموقع نفسه مرخص بترخيص AGPL v3.

لزيارة الموقع
للإطلاع على كود الموقع و تحميله بالكامل

جميع قواعد بيانات الموقع يمكن تحميلها من هنا

و يمكنك أيضاً المشاركة في ترجمة الموقع إلى العربية هنا

Thank you to saeb for writing this little description of Tatoeba in Arabic.