Tuesday, August 3, 2010

Submission policy - What kind of content do we want?

This article explains what kind of content we accept in Tatoeba, what kind of content we delete and what kind of content we review. Note that this article is not final. You have the right to object to something or to ask for more clarifications.

What do we accept?

Tatoeba is about collecting sentences so we only want sentences. However, what exactly do we mean by "sentences"? What is a sentence and what is not? It's actually a difficult question... No one will doubt that "I am happy" is a sentence. But what about "On the left", is that a sentence? What about "Thank you", "Yes", or "Awesome"?

As far as I'm concerned, I think Tatoeba can handle a loose definition of "sentence". We don't strictly need to have an entity with at least a verb. To me, when spoken, everything is a sentence. When written, the main difference between a sentence and a non-sentence is punctuation. That's all. For the rest, as long as people can imagine context where the "sentence" can be expressed, then it's a sentence.
So yes, I'm roughly saying that you can take all the words in the dictionary, add punctuation and perhaps a capital letter, you'd turn it into a sentence. I don't encourage it because it's not useful (dictionaries do that already), but one-word sentences are still tolerated. I'll trust people's common sense for adding only one-word sentences that are significant (for instance, "Hello" is, "House" isn't).

In case you run across sentences that are not strictly speaking sentences, then tag them as "non-sentence", so that there is a way to quickly identify them. Inform the owner about this article if he's a new member, and let him know it's better to to have sentences with more context.
At any rate, don't bother starting endless discussions if the sentence has already been translated because it will be kept as is. Feel free however to add a new sentence based on the "non-sentence".

Generally speaking, Tatoeba is open to many kinds of sentences. We tolerate casual speech, slang, insults (as long as they are not targeting anyone in particular), erotic sentences, sentences that are not "true" (after all, Tatoeba is not an encyclopedia). These sentences can be tagged accordingly to inform users. But I'll ask people to focus primarily on appropriate and politically correct sentences. We don't have (yet) a good system to filter out sentences that are not very "safe", so don't flood us with those, please.

What do we delete?

What we delete for sure are:
  • Entries that people add by mistake due to our failure to provide a more efficient interface.
  • Sentences that owners themselves requested to delete (because the delete feature is still not available to everyone).
  • Entries that are copyrighted or under a license that is not compatible with CC-BY.
  • Racist comments and personal attacks, if they are really harmful and there is a general agreement that it should be removed.
  • Entries that really make no sense and whose owner won't provide any explanation.
In the perspective of providing better content, I'm also allowing the deletion of "sentences" that are "not really sentences" and came from the Tanaka Corpus, but only under these conditions:
  • The vocabulary is already illustrated in other sentences.
  • There is only the Japanese-English pair, no translation into any other language. We can make an exception for French (i.e. it's still deletable if there is a French translation).
  • All the sentences that will be deleted do NOT belong to anyone.
It may be obvious, but you should avoid translating a sentence that is likely to be deleted... Unless you want to stand against its deletion.

What do we review?

By "reviewing" I mean correcting mistakes. So we correct spelling mistakes, grammar mistakes, bad formulations, etc. We want Tatoeba's data to be used (or at least usable) for educational purpose so we want good quality sentences.

However, the limit between a "correct" and "incorrect" sentence is not always clear and some sentences can generate a lot of debate. In such cases, the final decision belongs to the owner of the sentence.

Remember that Tatoeba allows several translations in a same language, so there is no point fighting endlessly on what is correct or not. Simply add another version of the sentence if you are not happy with the existing one, we don't mind at all having near duplicate sentences (cf. this discussion on the Wall, and more precisely my thoughts on the issue here).

We also don't want any kind of annotations in the sentences. You can find more details in the contributor's guide, rule #9. If you have a good reason to keep your annotations, then please explain it in your comments. Otherwise moderators have the right to edit your sentence two weeks after you have been requested to change your sentence.

What do we link?

Tatoeba's sentences are represented as a graph. Two sentences that are linked together have the same meaning. Linking two sentences in the same language is accepted, but you shouldn't link only based on meaning. The sentences that you link should also have an equivalent "style" and type of speech. Cf. my wall post here.

NOTE: Only trusted users can link sentences.


  1. A quick clarification/reconsideration request on near duplicate sentences. So far, I've resisted adding near duplicate translations as I don't see how these would add any real value when there are already heaps of search results to wade through.

    Most commonly it's something like a translation of 'they' from English to Icelandic which can be any of 'þeir', 'þær' or 'þau' depending on whether they are all male, all female or a mixture thereof. Other times it's synonyms or all-but identical parts of sentences.

    In these cases I've added comments on these alternatives, looking forward to the time where there may be fields for meta-data (I sent an email to Allan/sysco with this feature request and he noted that this had been discussed in the dev team). I'd be hesitant to, but is it really better to add bunches of near-redundant sentences?

  2. I have added links to a discussion we had on the Wall about this.

    => http://tatoeba.org/eng/wall/show_message/1085

    I'll copy my response here as well:

    Our position is: people can do whatever they like. If they want to add all the possible variations, they can. If they don't want to, they don't have to.

    It doesn't hurt to have "near duplicates". It just make Tatoeba a bit noisy. But that's our job, as engineers, to figure out how to filter and organize data so that it can be used efficiently for language learners.

    Meanwhile, as sysko said, variations of sentences can be very useful for language processing, so we shouldn't delete them.



Note: Only a member of this blog may post a comment.