Tuesday, January 25, 2011

Legally valid content

This article aims to give general instructions on how to contribute legally valid content in Tatoeba, to minimize the risk of Tatoeba being shut down for having illegal content (not saying it will be happening anytime soon, but better be safe).

If there is one thing you will need to remember, it is this: do not add non CC-BY sentences in Tatoeba.

Non CC-BY sentences

Perhaps "non CC-BY sentence" is a bit cryptic for some of you so let me clarify what it means. CC-BY is a short name for the Creative Commons Attribution license. Tatoeba redistributes all its sentences under this license. A non CC-BY sentence is simply a sentence that is not compatible with the CC-BY license.
  • Anything that is under copyright is NOT compatible with CC-BY (that includes quotes from books, movies, songs...).
  • Anything that is under a license that has a "share alike" condition is NOT compatible with CC-BY. CC-BY-SA is not compatible with CC-BY. That means you can't copy text from Wikipedia into Tatoeba. But CC-BY is compatible with CC-BY-SA, so you may insert sentences from Tatoeba in Wikipedia, or Wikiquote for instance.
  • Anything that is under a license that has a "no commercial use" condition is NOT compatible with CC-BY.
  • Anything that is not under any license is not NOT compatible with CC-BY. If there's no license, it means by default that the author doesn't authorize re-use.
  • Anything that basically doesn't say "You can do absolutely whatever you want with this as long as you" is NOT compatible with CC-BY. Update: this last statement was an over-simplification. This is has caused confusion so I'm removing it.

CC-BY sentences

But now you may wonder, what IS compatible with the CC-BY license?
  • Anything that is under CC-BY is compatible with CC-BY. Sentences that you add in Tatoeba and that were created by yourself are under CC-BY, because you agreed with the Terms of Use.
  • Anything that is in the public domain is compatible with CC-BY. If the author of a book was dead 100 years ago, then you can pretty much safely consider that the book is the public domain.
  • Anything that basically says "You can do absolutely whatever you want with this" should be compatible with CC-BY.

The basic rules to contribute legal content

1) If you want to be sure that your sentences are legally valid, do NOT copy-paste from anywhere (especially NOT from textbooks, electronic dictionaries, or other language learning websites), only come up with your own sentences.

2) We delete non CC-BY sentences. Depending on the situation, we may either delete the sentence right away, or give the contributor a delay to defend their sentence.

3) Do NOT translate a sentence that you think is non CC-BY. Instead, post a comment to express your doubts about the legal status of the sentence. If you are a trusted user, add the tag "@possibly non CC-BY". If you see other people adding or translating non CC-BY content, tell them NOT to do that.

4) If you do copy-paste from somewhere else, indicate in the comments where you copy-paste from. Give all the information you can so that we can easily find out it is indeed CC-BY compatible.

5) We will block a user's possibility to contribute (add, translate, edit sentences) if they are not following these rules.

6) To be honest, it can happen that we delete sentences that are legally valid, because the limit between legal vs non-legal is not always clear. If you are a specialist about these legal issues, please help us define a clear method to determine whether a sentence is legally valid or not.

Related links

Here's a bunch links related to copyright and stuff. I'm just throwing them here for those who are interested in expanding their knowledge on the matter. Wikipedia obviously has a lot of information on the subject since they have to deal with the problem certainly more often than any other collaborative project out there.

Stats for the year 2010

Okay this is my last post about stats for the day (and for while). So I had already published part of them previously, but since a new year started, I'll republish the stats for the whole year 2010 :)

Number of sentences added per month

Visitors per month

Pageviews per month

Countries with the most visits (top 20)

That's something I didn't mention last time but I figured I'd also give these stats for people who are interested.
  1. United States (40,735)
  2. Japan (36,945)
  3. France (31,433)
  4. Germany (14,879)
  5. Italy (10,316)
  6. United Kingdom (10,088)
  7. China (9,069)
  8. Canada (7,637)
  9. Russia (7,232)
  10. Belgium (6,894)
  11. Poland (5,953)
  12. Spain (5,780)
  13. Philippines (4,680)
  14. Ukraine (4,260)
  15. Brazil (3,924)
  16. Australia (3,911)
  17. Iran (3,424)
  18. Mexico (3,096)
  19. Netherlands (3,069)
  20. India (2,986)
Note that this is the number of visits and not number of visitors. A visitor can visit several times a website.

Sentences stats (Jan 2011)

Alright so I published the stats for Tatoeba day #2, and it is also the occasion to publish more general stats about how the corpus is progressing.

Languages ranking

Top 5
  • English - 167,000+. That's about 10,000 more than two months ago. CK has been adding about 2500 sentences from Voice of America.
  • Japanese - 153,000+. Hasn't progressed much ^^
  • Esperanto - ~70,000. It currently indicates over 70,000 in the stats, but there are over 2000 duplicates, so it's not exactly 70,000 yet. However, Esperanto is now the 3rd most important language in Tatoeba! Incredible achievement :)
  • French - 57,000. 4000 new sentences compared to 2 months ago.
  • German - 43,000. 11,000 new sentences compared to 2 months ago. At this rate it will not take long before German outranks French as well :P

Other languages with 10,000+ sentences
  • Spanish - 25,000+. 6000 new sentences compared to 2 months ago, and gained one rank :D
  • Polish - 24,000. Lost its rank to Spanish, but still gained 4000 sentences.
  • Russian - 22,000+. Also gained 4000 sentences, and is still at the same position.
  • Chinese Mandarin - 16,000+. Gained 1000 sentences, still at the same position.
  • Ukrainian - 15,000+. Gained 1000 sentences, also remained at the same position.
  • Italian - 14,000+. Remains at the same position but finally reached the 10,000 milestone :D And gained 5500 sentences since last time.
  • Dutch - 12,000+. Dutch also joined the 10,000 family! Gained one rank and pretty much doubled in quantity.
  • Hungarian - 10,000+. 3rd language to join this category! Very fast progression. Gained 7000 sentences and is ranked 12th while it was ranked 18th last time!
Other languages with 1,000+ sentences
  • Hebrew - 8,000+. Pretty much doubled in quantity as well!
  • Arabic - 7,500+. Has been slowing down. Only 1000 new sentences compared to last time.
  • Portuguese - 7,000.
  • Icelandic - 5,500+.
  • Persian - 5,000+. Persian is new here! Maybe we'll see it in the 10,000 category in a few months :)
  • Danish - 4,500+.
  • Hindi - 3,500.
  • Turkish - 3,300.
  • Uyghur - 3,000.
  • Shanghainese - 2,700.
  • Vietnamese - 2,600.
  • Belarusian - 2,000.
  • Cantonese - 1,700.
  • Norwegian (BokmÃ¥l) - 1,600.
  • Lojban - 1,100. Lojban is new here as well!
  • Swedish - 1,000. And Swedish too!

Other numbers
  • We've reached the 700,000 milestone this month! Although we have 6000+ duplicates sentences so it's not really 700,000 yet.
  • We're currently supporting 83 languages.
  • We have 8000+ sentences with audio.

Monday, January 24, 2011

Stats for Tatoeba day #2

The theme for Tatoeba day #2 was quality. For this day we wanted encouraged people to adopt, check, correct sentences, rather than adding lots of sentences and translations. So here are the stats to get an idea of how much has been done :)


Shortly before the start of Tatoeba day, we updated the site and made available a page that lists sentences without an owner. The number of orphan sentences at the beginning of Tatoeba day was 254779. At the end, it was 252331. So an additional 2448 sentences had a home at the end of the day :)

By language

I'm not going to publish the number of orphan sentences for each language. I'll only show the number of adoption for each language on Tatoeba day.

Languages which need adoption the most are Japanese (148,000+ orphans), English (89,000+ orphans) and French (13,000+ orphans).
Russian, Vietnamese, Esperanto, Spanish and Dutch need a bit of attention too, but they have a low population of orphans (less than a few hundreds).

By user

24 users have been adopting.

CK is definitely our most active adopter with 873 adoptions that day. He's like the proof-reading master of English sentences coming from the Tanaka Corpus.
He's followed by szaby78 with 419 adoptions. But szaby78 has adopted all the orphan Hungarian sentences.
Then in 3rd position we have Guybrush88, with 194 adoptions for Italian.


There were 1,370 sentences tagged 'OK' on Tatoeba day. Mostly by CK, for English sentences.
  • CK (1184)
  • LaraCroft (74)
  • Guybrush88 (56)
  • arcticmonkey (48)
  • xtofu80 (3)
  • Zifre (2)
  • fucongcong (1)
  • Pharamp (1)
  • Shishir (1)


There has been a total of 422 sentences corrected. Szaby78 has been the most active in trying to correct sentences.
  • szaby78 (51)
  • Shishir (33)
  • Nero (23)
  • ludoviko (22)
  • jakov (19)
  • qdii (19)
  • zipangu (18)
  • Zifre (17)
  • GilHut (14)
  • CK (11)
  • martinod (10)
  • xtofu80 (10)
  • Eldad (10)
  • Dejo (10)
  • U2FS (9)
  • GrizaLeono (9)
  • Esperantodan (8)
  • Hans07 (8)
  • Archibald (8)
  • Guybrush88 (8)
  • Pharamp (8)
  • LaraCroft (7)
  • Esperantostern (6)
  • nickyeow (5)
  • JimBreen (5)
  • Farkas (5)
  • Riskemulo (4)
  • arcticmonkey (4)
  • ventana (4)
  • sysko (4)
  • Vortarulo (4)
  • esocom (4)
  • landano (4)
  • MUIRIEL (4)
  • ivanov (4)
  • rado (3)
  • kebukebu (2)
  • mamat (2)
  • Alois (2)
  • Muelisto (2)
  • darinmex (2)
  • ismailzali (2)
  • excaelestis (1)
  • shanghainese (1)
  • kolonjano (1)
  • catakaoe (1)
  • brauliobezerra (1)
  • kurteago (1)
  • sigfrido (1)
  • jxan (1)
  • sacredceltic (1)
  • pandark (1)
  • boracasli (1)
  • TRANG (1)
  • pqs (1)
  • pohli (1)
  • autuno (1)
  • manuk7 (1)
  • MikeMolto (1)
  • fucongcong (1)


There has been 503 comments posted, almost half of them by CK who was mostly pointing out duplicate sentences.
  • CK (214)
  • arcticmonkey (35)
  • martinod (31)
  • Zifre (20)
  • Shishir (19)
  • Eldad (19)
  • Pharamp (19)
  • ivanov (17)
  • GrizaLeono (15)
  • Dejo (13)
  • U2FS (13)
  • Archibald (11)
  • Nero (10)
  • szaby78 (9)
  • qdii (8)
  • GilHut (7)
  • zipangu (7)
  • jakov (7)
  • Hans07 (7)
  • dziglo (6)
  • LaraCroft (6)
  • Vortarulo (6)
  • Guybrush88 (5)
  • xtofu80 (5)
  • nickyeow (5)
  • landano (5)
  • ludoviko (5)
  • fucongcong (4)
  • sacredceltic (4)
  • pandark (3)
  • Esperantodan (3)
  • sysko (3)
  • darinmex (3)
  • MUIRIEL (3)
  • Farkas (2)
  • ismailzali (2)
  • Swift (2)
  • Muelisto (2)
  • Esperantostern (2)
  • jxan (2)
  • sigfrido (2)
  • brauliobezerra (1)
  • JimBreen (1)
  • samueldora (1)
  • ventana (1)
  • pohli (1)
  • azulhana (1)
  • tuuli (1)
  • BraveSentry (1)
  • rado (1)
  • boracasli (1)
  • rpglover64 (1)

Next Tatoeba day

Our next Tatoeba day is scheduled on February 20th. We chose that date because it's the Sunday right before International Mother Language Day on February 21st :)

We haven't decided yet what the theme will be, but the banners mini-contest deadline is delayed to that date since I received only 3 submissions. In any case I will write another blog post about it when the time comes. Thank you to everyone who contributed to this 2nd Tatoeba day :)

And more general stats in the next posts...

Saturday, January 22, 2011

Tatoeba update (Jan 22nd, 2011)

First update of the year! We're adding a couple of new things that will be useful for Tatoeba day #2, which is starting soon :) I'll also mention a few changes that were made at the end of December, but I didn't feel like writing a post especially for them.

What's new
  • There is a page that lists sentences with audio. [change made in December]
  • The download feature for lists is limited to those that have 50 sentences or less. We had to do that otherwise it can cause Tatoeba to be unavailable. [change made in December]
  • The "Contribute" section is now divided into several categories: add, translate, adopt, improve, discuss.
  • You cannot add the tag 'OK' on your own sentences, it will refuse to save. It's more useful to let others tag your sentences with 'OK' because it the fact that you own a sentence already means you are okay with it.
  • The status of users is now indicated in their profile and contributions page.

What next

We will also include in the "Contribute" section a page where you can enter the id's of 2 sentences to link or unlink them. This feature will be restricted to trusted users.

Other than that, you may want to learn about how the next version of Tatoeba is progressing here :)

Monday, January 17, 2011

Grant from Mozilla Drumbeat

Quick summary

Before I get to the point, I'll summarize quickly our story with Drumbeat from the beginning :)

Back in May 2010, we found out about Mozilla Drumbeat and felt that we definitely had our place there. For those who haven't heard of it, Drumbeat is a platform that was launched by Mozilla at the beginning of 2010 and aims to gather projects that keep the Web open -- because Mozilla has been working on an open browser all these years and now they want to work on the open Web.

Since it was (and still kinda is) a young platform, we didn't exactly know what we could/should expect from it, but we felt it was worth trying to join. We mostly wanted to make it to the "featured projects", and perhaps even receive a bit of financial support. We found out about their joint fellowship with the Shuttleworth Foundation and figured we could to participate, and it was even more incentive to join since applying for the fellowship implied creating a project page on Drumbeat. So we did that, following the instructions on how to apply, and then we did our best to get as many people as possible to vote for our project on Drumbeat, so we could reach the top 5 popular projects and get some attention. For those who wonder, I'm not sure what happened to this fellowship project to be honest because I haven't heard from it since then. Perhaps it was dropped, I don't know. But anyways, besides of that, I also had the chance to attend their really awesome and intense events in Paris (in July) and Barcelona (in November) and we were mentioned in one of their newsletters (in August).

Eventually, on December 31st we received really great news: they decided to give the project a $2,500 grant. We were actually told the good news in October, but there wasn't anything really official yet about it and we didn't know how much the project would receive. This time though, it looks official :D So huge, huge thank you to Mozilla Drumbeat! :)

What are we going to do with the money?

So far, we've decided to donate a part of it:
- to the FSF France who's been hosting us for since April 2010.
- to Shtooka who's been helping us for gathering audio (today we have almost 8000 sentences with audio).
- to Tokidoki whose founder has been hosting Tatoeba prior to the FSF France and who has given us ~20,000 French translation of the Tanaka Corpus in the early days of the project.

We'll also start making goodies for our growing fanbase ;) However goodies means designing good stuff to put on a T-shirt or things like that, and we haven't designed anything yet so don't expect the goodies right away. If you're a talented artist and would like to help out, don't hesitate to contat us or come talk to us on IRC (freenode, #tatoeba)!

And we'll probably do a few other things, but we also want to try keeping some of it to pay the transportation and accommodation to attend future events such as the RMLL 2011 :)

Sunday, January 9, 2011

Tatoeba day #2 (Jan 23rd, 2011)

We have decided of a date for our next Tatoeba day, and it will be January 23rd, 2011. Just like the last Tatoeba day, it will start at 0:00 and will end at 23:59 (France time).

There will be 2 objectives for that day:
  • Banners for Tatoeba
  • Quality of the corpus

Banners for Tatoeba

For those who are not sure what I'm talking about, what we call a "banner" is basically an image that represents a website. Let's say you are a fan of Tatoeba and have a personal blog. Because you are very supportive, you would like to put a link to Tatoeba on your blog, and perhaps you would like the link to be graphical rather than simple text. Well, we don't really have any standard image for people to use in such situations and we'd like to create some.

So we're organizing a little contest for our contributors with artistic/design skills (or who just want to give it a try): create banners for Tatoeba!

Everyone can participate and if you want to, here's what to do or to know:
  1. Make 2 images with the following sizes: 88x31 and 392x72. You may re-use our current logo in it, but don't hesitate to make another (better) logo if you are inspired.
  2. Send your 2 images to trang@tatoeba.fr with the title "Tatoeba banners" and indicate in the email your Tatoeba username. I will reply you back to confirm that I have received them.
  3. The current deadline is January 23rd, 13:00 (France time). However, we will need at the very least 5 submissions, otherwise it's not very interesting :P I do hope there will be more than 5, but if there isn't been enough submissions, we will extend the deadline to the next Tatoeba day: February 20th, same time.
  4. Shortly after the deadline I will publish the banners that were sent to me. Then Tatoeba users will have one week to vote for their favorite banners. I'm not sure yet how we will do the votes but I will write about it in due time. IMPORTANT: I don't want people to be influenced by "who made the banner" during the vote so I will not indicate this information when I first publish the banners. I will ask you as well to keep your work "secret" the whole time (don't show it to anyone and don't say "I did this one").
  5. Once the votes are over, I will reveal the participants and announce the winner who will then be venerated forever by everyone for his/her talent :)

Quality of the corpus

Since Tatoeba is open for everyone to contribute, one of its biggest problem is quality. Contributors aren't necessarily professionals and we inevitably have many sentences that contain mistakes or don't sound right. For our 2nd Tatoeba day, we will be focusing on quality. The goal of the day will be to check, correct and improve as many sentences as possible.

We've got plenty of sentences sentences tagged "@Needs Native Check", "@change" and "@check", and it would be really nice to remove as many of these tags as possible to replace them with the 'OK' tag. We've also got plenty of orphan sentences that desperately need parents.

If you want to participate, don't be shy and join our IRC channel #tatoeba on January 23rd (cf. our help page to learn how to use IRC, in case you are not familiar with IRC). This way you can discuss in real time with other members about what to do with a sentence (among other things)!

The next day, I will be publishing the following stats:
  • The number of sentences modified.
  • The number of comments posted.
  • The number of sentences tagged 'OK'.
  • The approximate number of sentences adopted.
I'll be honest though, things might be a little disorganized at first. I don't know yet how many people intend to participate and I don't how yet how we will coordinate with each other to work efficiently together. But this second Tatoeba will be the occasion to experiment and hopefully figure out something :)

NOTE: You may want to read these articles to learn a bit more about how we handle quality, even though they are not up-to-date anymore.