Showing posts with label en. Show all posts
Showing posts with label en. Show all posts

Saturday, February 19, 2011

Tatoeba update (Feb 19th, 2011)

Second update of the year :)

What's new
  • When browsing the profile, the sentences, the comments, the favorites or the Wall messages of a user, you will see a menu that will make it easier to jump between each of these pages.
  • We've added a page that lists all the sentences of a user, but with the sentences options (translate, adopt, favorite, etc). This is primarily to make it a bit easier to translate sentences of a specific user so you will find this page by browsing the sentences of a user and click on "Translate these sentences".
  • We've added pagination for private messages.
  • Last but not least, we've stabilized the language of the interface. If your interface is in Chinese, and you click on a link where the language is set to Esperanto, you shouldn't see your interface change to Esperanto anymore.

What next
  • Improvement of the profile (bigger textarea to edit your description).
  • Link/unlink feature (to be added in the "Contribution" sub-menu).

Tuesday, January 25, 2011

Legally valid content

This article aims to give general instructions on how to contribute legally valid content in Tatoeba, to minimize the risk of Tatoeba being shut down for having illegal content (not saying it will be happening anytime soon, but better be safe).

If there is one thing you will need to remember, it is this: do not add non CC-BY sentences in Tatoeba.


Non CC-BY sentences

Perhaps "non CC-BY sentence" is a bit cryptic for some of you so let me clarify what it means. CC-BY is a short name for the Creative Commons Attribution license. Tatoeba redistributes all its sentences under this license. A non CC-BY sentence is simply a sentence that is not compatible with the CC-BY license.
  • Anything that is under copyright is NOT compatible with CC-BY (that includes quotes from books, movies, songs...).
  • Anything that is under a license that has a "share alike" condition is NOT compatible with CC-BY. CC-BY-SA is not compatible with CC-BY. That means you can't copy text from Wikipedia into Tatoeba. But CC-BY is compatible with CC-BY-SA, so you may insert sentences from Tatoeba in Wikipedia, or Wikiquote for instance.
  • Anything that is under a license that has a "no commercial use" condition is NOT compatible with CC-BY.
  • Anything that is not under any license is not NOT compatible with CC-BY. If there's no license, it means by default that the author doesn't authorize re-use.
  • Anything that basically doesn't say "You can do absolutely whatever you want with this as long as you" is NOT compatible with CC-BY. Update: this last statement was an over-simplification. This is has caused confusion so I'm removing it.


CC-BY sentences

But now you may wonder, what IS compatible with the CC-BY license?
  • Anything that is under CC-BY is compatible with CC-BY. Sentences that you add in Tatoeba and that were created by yourself are under CC-BY, because you agreed with the Terms of Use.
  • Anything that is in the public domain is compatible with CC-BY. If the author of a book was dead 100 years ago, then you can pretty much safely consider that the book is the public domain.
  • Anything that basically says "You can do absolutely whatever you want with this" should be compatible with CC-BY.


The basic rules to contribute legal content

1) If you want to be sure that your sentences are legally valid, do NOT copy-paste from anywhere (especially NOT from textbooks, electronic dictionaries, or other language learning websites), only come up with your own sentences.

2) We delete non CC-BY sentences. Depending on the situation, we may either delete the sentence right away, or give the contributor a delay to defend their sentence.

3) Do NOT translate a sentence that you think is non CC-BY. Instead, post a comment to express your doubts about the legal status of the sentence. If you are a trusted user, add the tag "@possibly non CC-BY". If you see other people adding or translating non CC-BY content, tell them NOT to do that.

4) If you do copy-paste from somewhere else, indicate in the comments where you copy-paste from. Give all the information you can so that we can easily find out it is indeed CC-BY compatible.

5) We will block a user's possibility to contribute (add, translate, edit sentences) if they are not following these rules.

6) To be honest, it can happen that we delete sentences that are legally valid, because the limit between legal vs non-legal is not always clear. If you are a specialist about these legal issues, please help us define a clear method to determine whether a sentence is legally valid or not.


Related links

Here's a bunch links related to copyright and stuff. I'm just throwing them here for those who are interested in expanding their knowledge on the matter. Wikipedia obviously has a lot of information on the subject since they have to deal with the problem certainly more often than any other collaborative project out there.

Stats for the year 2010

Okay this is my last post about stats for the day (and for while). So I had already published part of them previously, but since a new year started, I'll republish the stats for the whole year 2010 :)

Number of sentences added per month


Visitors per month


Pageviews per month


Countries with the most visits (top 20)

That's something I didn't mention last time but I figured I'd also give these stats for people who are interested.
  1. United States (40,735)
  2. Japan (36,945)
  3. France (31,433)
  4. Germany (14,879)
  5. Italy (10,316)
  6. United Kingdom (10,088)
  7. China (9,069)
  8. Canada (7,637)
  9. Russia (7,232)
  10. Belgium (6,894)
  11. Poland (5,953)
  12. Spain (5,780)
  13. Philippines (4,680)
  14. Ukraine (4,260)
  15. Brazil (3,924)
  16. Australia (3,911)
  17. Iran (3,424)
  18. Mexico (3,096)
  19. Netherlands (3,069)
  20. India (2,986)
Note that this is the number of visits and not number of visitors. A visitor can visit several times a website.

Sentences stats (Jan 2011)

Alright so I published the stats for Tatoeba day #2, and it is also the occasion to publish more general stats about how the corpus is progressing.

Languages ranking

Top 5
  • English - 167,000+. That's about 10,000 more than two months ago. CK has been adding about 2500 sentences from Voice of America.
  • Japanese - 153,000+. Hasn't progressed much ^^
  • Esperanto - ~70,000. It currently indicates over 70,000 in the stats, but there are over 2000 duplicates, so it's not exactly 70,000 yet. However, Esperanto is now the 3rd most important language in Tatoeba! Incredible achievement :)
  • French - 57,000. 4000 new sentences compared to 2 months ago.
  • German - 43,000. 11,000 new sentences compared to 2 months ago. At this rate it will not take long before German outranks French as well :P

Other languages with 10,000+ sentences
  • Spanish - 25,000+. 6000 new sentences compared to 2 months ago, and gained one rank :D
  • Polish - 24,000. Lost its rank to Spanish, but still gained 4000 sentences.
  • Russian - 22,000+. Also gained 4000 sentences, and is still at the same position.
  • Chinese Mandarin - 16,000+. Gained 1000 sentences, still at the same position.
  • Ukrainian - 15,000+. Gained 1000 sentences, also remained at the same position.
  • Italian - 14,000+. Remains at the same position but finally reached the 10,000 milestone :D And gained 5500 sentences since last time.
  • Dutch - 12,000+. Dutch also joined the 10,000 family! Gained one rank and pretty much doubled in quantity.
  • Hungarian - 10,000+. 3rd language to join this category! Very fast progression. Gained 7000 sentences and is ranked 12th while it was ranked 18th last time!
Other languages with 1,000+ sentences
  • Hebrew - 8,000+. Pretty much doubled in quantity as well!
  • Arabic - 7,500+. Has been slowing down. Only 1000 new sentences compared to last time.
  • Portuguese - 7,000.
  • Icelandic - 5,500+.
  • Persian - 5,000+. Persian is new here! Maybe we'll see it in the 10,000 category in a few months :)
  • Danish - 4,500+.
  • Hindi - 3,500.
  • Turkish - 3,300.
  • Uyghur - 3,000.
  • Shanghainese - 2,700.
  • Vietnamese - 2,600.
  • Belarusian - 2,000.
  • Cantonese - 1,700.
  • Norwegian (Bokmål) - 1,600.
  • Lojban - 1,100. Lojban is new here as well!
  • Swedish - 1,000. And Swedish too!

Other numbers
  • We've reached the 700,000 milestone this month! Although we have 6000+ duplicates sentences so it's not really 700,000 yet.
  • We're currently supporting 83 languages.
  • We have 8000+ sentences with audio.

Monday, January 24, 2011

Stats for Tatoeba day #2

The theme for Tatoeba day #2 was quality. For this day we wanted encouraged people to adopt, check, correct sentences, rather than adding lots of sentences and translations. So here are the stats to get an idea of how much has been done :)

Adoptions

Shortly before the start of Tatoeba day, we updated the site and made available a page that lists sentences without an owner. The number of orphan sentences at the beginning of Tatoeba day was 254779. At the end, it was 252331. So an additional 2448 sentences had a home at the end of the day :)

By language

I'm not going to publish the number of orphan sentences for each language. I'll only show the number of adoption for each language on Tatoeba day.


Languages which need adoption the most are Japanese (148,000+ orphans), English (89,000+ orphans) and French (13,000+ orphans).
Russian, Vietnamese, Esperanto, Spanish and Dutch need a bit of attention too, but they have a low population of orphans (less than a few hundreds).

By user

24 users have been adopting.



CK is definitely our most active adopter with 873 adoptions that day. He's like the proof-reading master of English sentences coming from the Tanaka Corpus.
He's followed by szaby78 with 419 adoptions. But szaby78 has adopted all the orphan Hungarian sentences.
Then in 3rd position we have Guybrush88, with 194 adoptions for Italian.


Validations

There were 1,370 sentences tagged 'OK' on Tatoeba day. Mostly by CK, for English sentences.
  • CK (1184)
  • LaraCroft (74)
  • Guybrush88 (56)
  • arcticmonkey (48)
  • xtofu80 (3)
  • Zifre (2)
  • fucongcong (1)
  • Pharamp (1)
  • Shishir (1)


Corrections

There has been a total of 422 sentences corrected. Szaby78 has been the most active in trying to correct sentences.
  • szaby78 (51)
  • Shishir (33)
  • Nero (23)
  • ludoviko (22)
  • jakov (19)
  • qdii (19)
  • zipangu (18)
  • Zifre (17)
  • GilHut (14)
  • CK (11)
  • martinod (10)
  • xtofu80 (10)
  • Eldad (10)
  • Dejo (10)
  • U2FS (9)
  • GrizaLeono (9)
  • Esperantodan (8)
  • Hans07 (8)
  • Archibald (8)
  • Guybrush88 (8)
  • Pharamp (8)
  • LaraCroft (7)
  • Esperantostern (6)
  • nickyeow (5)
  • JimBreen (5)
  • Farkas (5)
  • Riskemulo (4)
  • arcticmonkey (4)
  • ventana (4)
  • sysko (4)
  • Vortarulo (4)
  • esocom (4)
  • landano (4)
  • MUIRIEL (4)
  • ivanov (4)
  • rado (3)
  • kebukebu (2)
  • mamat (2)
  • Alois (2)
  • Muelisto (2)
  • darinmex (2)
  • ismailzali (2)
  • excaelestis (1)
  • shanghainese (1)
  • kolonjano (1)
  • catakaoe (1)
  • brauliobezerra (1)
  • kurteago (1)
  • sigfrido (1)
  • jxan (1)
  • sacredceltic (1)
  • pandark (1)
  • boracasli (1)
  • TRANG (1)
  • pqs (1)
  • pohli (1)
  • autuno (1)
  • manuk7 (1)
  • MikeMolto (1)
  • fucongcong (1)

Comments

There has been 503 comments posted, almost half of them by CK who was mostly pointing out duplicate sentences.
  • CK (214)
  • arcticmonkey (35)
  • martinod (31)
  • Zifre (20)
  • Shishir (19)
  • Eldad (19)
  • Pharamp (19)
  • ivanov (17)
  • GrizaLeono (15)
  • Dejo (13)
  • U2FS (13)
  • Archibald (11)
  • Nero (10)
  • szaby78 (9)
  • qdii (8)
  • GilHut (7)
  • zipangu (7)
  • jakov (7)
  • Hans07 (7)
  • dziglo (6)
  • LaraCroft (6)
  • Vortarulo (6)
  • Guybrush88 (5)
  • xtofu80 (5)
  • nickyeow (5)
  • landano (5)
  • ludoviko (5)
  • fucongcong (4)
  • sacredceltic (4)
  • pandark (3)
  • Esperantodan (3)
  • sysko (3)
  • darinmex (3)
  • MUIRIEL (3)
  • Farkas (2)
  • ismailzali (2)
  • Swift (2)
  • Muelisto (2)
  • Esperantostern (2)
  • jxan (2)
  • sigfrido (2)
  • brauliobezerra (1)
  • JimBreen (1)
  • samueldora (1)
  • ventana (1)
  • pohli (1)
  • azulhana (1)
  • tuuli (1)
  • BraveSentry (1)
  • rado (1)
  • boracasli (1)
  • rpglover64 (1)

Next Tatoeba day

Our next Tatoeba day is scheduled on February 20th. We chose that date because it's the Sunday right before International Mother Language Day on February 21st :)

We haven't decided yet what the theme will be, but the banners mini-contest deadline is delayed to that date since I received only 3 submissions. In any case I will write another blog post about it when the time comes. Thank you to everyone who contributed to this 2nd Tatoeba day :)

And more general stats in the next posts...

Saturday, January 22, 2011

Tatoeba update (Jan 22nd, 2011)

First update of the year! We're adding a couple of new things that will be useful for Tatoeba day #2, which is starting soon :) I'll also mention a few changes that were made at the end of December, but I didn't feel like writing a post especially for them.

What's new
  • There is a page that lists sentences with audio. [change made in December]
  • The download feature for lists is limited to those that have 50 sentences or less. We had to do that otherwise it can cause Tatoeba to be unavailable. [change made in December]
  • The "Contribute" section is now divided into several categories: add, translate, adopt, improve, discuss.
  • You cannot add the tag 'OK' on your own sentences, it will refuse to save. It's more useful to let others tag your sentences with 'OK' because it the fact that you own a sentence already means you are okay with it.
  • The status of users is now indicated in their profile and contributions page.

What next

We will also include in the "Contribute" section a page where you can enter the id's of 2 sentences to link or unlink them. This feature will be restricted to trusted users.

Other than that, you may want to learn about how the next version of Tatoeba is progressing here :)

Friday, December 10, 2010

Tatoeba update (Dec 10th, 2010)

What's new

Sentences stats. There's now a specific page for the sentences stats, to make them a bit more readable. The total number of sentences is also now indicated (it's a quite important number, but for some reason we never displayed it anywhere).

Wall messages of a user. You can browse the messages that were posted by a specific user, from the user profil. Click on "See this user's contribution", scroll to the bottom of the page. You will see the latest messages posted by the user, and a link to view them all (if the user has posted any message).

Sentences tagged more than 2 weeks ago. That's useful mostly for moderators :)

New languages. We've added Ainu, Malayalam, Low German and Sicilian.

FAQ. In case you haven't noticed, the procedure to request a new language was updated (several weeks ago), and we added a new question, regarding audio.


What next
  • Improvement of the profile page. Because the way one can edit his profile at the moment is not really the most intuitive, nor practical.
  • Very certainly other things but I can't tell what yet because it will depend on my inspiration...

Sunday, November 21, 2010

Tatoeba update (Nov 21st, 2010)

Alright, it's been a long time since we last updated Tatoeba :) This is just a small update.

What's new

"Members" page. This is probably the main modification. We redesigned a little bit the "Members" page to look a bit better and to be less slow. We removed the information about the last login, because some people don't like being spied :P We removed the top 20 ranking because that's what makes the page so slow. Instead we're displaying the members who are currently active (those who participated to the few last hundreds contributions).

Tags info. If you hover your mouse over a tag, you will see the id of the user who added it, and the date when it was added. This is mostly useful for sentences owner, who may wonder why someone has tagged a sentence a certain way. You can figure out who's the user behind a certain id with the following URL: http://tatoeba.org/users/show/[id].

Set language to "unknown". We get requests for new languages quite frequently and we ask people to add a few sentences in the language they request. Except that the language is sometimes misdetected and there was no way to set the language to "unknown" (to indicate that it's a language that is not in the list). Now it's possible. There is an option called "other language", and will set the language icon to "unknown".

Sentence owner's name in comments. It was requested a long time ago, and it's finally here. The name of the sentence owner is now indicated in the comments, next to the sentence itself. This way, when you look at a comment on the homepage, you will not only know what sentence it is associated to, but also the user who added that sentence.


What next
  • We'll be working on a page that lists all sentences that were tagged @change and @delete more than 2 weeks ago. This way moderators will have a simple way to know what sentences they can/should take care of.
  • We'll be adding a page that lists all the Wall messages of a user.
  • And perhaps other random things...

Sunday, November 14, 2010

Tatoeba day & stats

Yesterday was our first Tatoeba day, so today I'm publishing stats about what has been been achieved that day, as well as more general stats.


Stats by language

The chart below shows the number of sentences added on Nov 13th for each language.


The gold medal goes to Arabic! Silver goes to Esperanto and bronze goes to German :)
  1. Arabic (573)
  2. Esperanto (354)
  3. German (247)
  4. Egyptian Arabic (230)
  5. Spanish (207)
  6. Italian (183)
  7. Chinese Mandarin (162)
  8. Hebrew (125)
  9. French (113)
  10. Ukrainian (105)
  11. Danish (100)
  12. Hungarian (78)
  13. Cantonese (78)
  14. English (73)
  15. Russian (70)
  16. Polish (45)
  17. Dutch (36)
  18. Old East Slavic (33)
  19. Lithuanian (18)
  20. Persian (17)
  21. Unknown language (10)
  22. Portuguese (8)
  23. Finnish (7)
  24. Latvian (4)
  25. Vietnamese (4)
  26. Czech (3)
  27. Swedish (3)
  28. Norwegian Bokmål (2)
  29. Shanghainese (2)
  30. Breton (1)
  31. Bulgarian (1)
  32. Catalan (1)
  33. Estonian (1)
  34. Japanese (1)
  35. Quechua (1)
  36. Slovak (1)
  37. Turkish (1)
  38. Uzbek (1)
Sadly, the record set on August 18th of 3465 sentences added was not broken. We only made it to 2899. It's still not bad though, since it's the 2nd most important day, in terms of sentences added (and by "sentences added" I mean "new sentences + translations").

We were missing a few of our devoted members that day, so I guess it's normal. Let's hope more people will be available for the next Tatoeba day :)


Stats by users

The chart below shows the number of sentences added (in green) and the number of sentences modified (in yellow) on Nov 13th, for the top 20 users. You'll excuse my laziness but I only used the number of sentences added for the rank.

Saeb wins the day, by far, with 802 sentences added! Congrats :D Second place goes to nickyeow, and third place goes to Eldad.

At any rate, everyone deserves a big thank you for their contributions! THANK YOU :)

  1. saeb (802/20)
  2. nickyeow (214/20)
  3. Eldad (166/17)
  4. aandrusiak (140/7)
  5. MUIRIEL (138/41)
  6. Guybrush88 (135/2)
  7. danepo (100/12)
  8. GrizaLeono (94/21)
  9. Shishir (94/12)
  10. Dejo (56/11)
  11. Archibald (54/32)
  12. darinmex (53/5)
  13. rado (52/2)
  14. Leono (51/10)
  15. esocom (51/4)
  16. Esperantostern (48/5)
  17. Muelisto (43/1)
  18. kroko (42/4)
  19. Dorenda (41/0)
  20. qdii (40/11)
  21. zipangu 37 2
  22. wondersz1 33 4
  23. Manfredo 27 1
  24. samueldora 24 2
  25. sysko 23 7
  26. szaby78 22 5
  27. Zifre 22 7
  28. cost (21/2)
  29. sencay (20/2)
  30. shanghainese (19/0)
  31. fanty (18/0)
  32. pliiganto (16/13)
  33. BraveSentry (15/1)
  34. pjer (14/5)
  35. U2FS (14/3)
  36. debian2007 (13/1)
  37. Gyuri (12/3)
  38. jxan (12/0)
  39. virgil (12/4)
  40. TRANG (11/32)
  41. slavneui (11/0)
  42. sarah (11/0)
  43. kebukebu (10/2)
  44. Wimmer (10/1)
  45. ae5s (10/0)
  46. Tonari (9/0)
  47. arashi_29 (9/5)
  48. Aleksej (7/0)
  49. CK (5/14)
  50. Shoyren (4/1)
  51. Holyspirit (3/0)
  52. JimBreen (2/0)
  53. luwenzhuo (2/0)
  54. CLARET (2/1)
  55. lajauge (1/0)
  56. ozma29 (1/0)
  57. sschlumberger (1/0)
  58. mr5 (1/0)
  59. Tenshi (1/0)

Language ranks

Tatoeba day is a good occasion to see how each language have progressed. You can see how each language with more than 1000 sentences was positioned one month ago, in this previous post. Let's how it is now...

Top 5

The top 5 hasn't changed.
  1. English - 158,000+. It looks like English has been growing a little bit.
  2. Japanese - 153,000+. Japanese is standing still. You can tell we don't have a very strong Japanese community.
  3. French - 53,000+. French seems keeps moving at a steady pace.
  4. Esperanto - 47,000+. Esperanto is catching up with French quickly...
  5. German - 32,000+. German is progressing better than French, but still not quite as well as Esperanto.
Other languages with 10,000+ sentences
  • Polish - 20,000+
  • Spanish - almost 19,000. Spanish gained one rank! :D
  • Russian - almost 18,000
  • Chinese Mandarin - almost 15,000
  • Ukrainian - 14,000+
Other languages with 1,000+ sentences
  • Italian - 8,500+
  • Arabic - 6,500+. Great boost for Arabic!
  • Dutch - almost 6,500
  • Portuguese - 6,000+
  • Hebrew - 4,500+. Great boost for Hebrew as well!
  • Icelandic - 4,000+
  • Hindi - almost 3,500
  • Hungarian - 3,000+. Hungarian joined the 1,000+ sentences club! Very good progress.
  • Turkish - 2,500+
  • Shanghainese - 2,500+
  • Uyghur - almost 2,500
  • Danish - 2,000+. Danish is new to the club with very good progress as well!
  • Vietnamese - 2,000+
  • Belarusian - almost 2,000
  • Norwegian Bokmål - 1,500+
  • Cantonese - 1,500+

Other numbers
  • 55,735 sentences added in October.
  • About 25,000 sentences added since the beginning of November.
  • We've reached 600,000 sentences in total today!
  • But there are probably thousands of duplicates, so it's not really 600,000 yet...
  • We will soon have 76 languages. 5 are waiting to be added: Galician, Irish, Interlingua, Lojban, Toki Pona. Note that the last 3 languages are constructed languages.

Next Tatoeba day

A potential date for the Tatoeba day would be December 11th. Although it could be December 18th as well. We'll see what suits best for everyone.

The main objective of the first Tatoeba day was to break the record of the highest number of sentences added in one day. We didn't break it, but it's okay because we still had fun :D

The main objective will be different for the second Tatoeba day. We haven't decided what it will be yet, but I think it would be nice to emphasize on adoption next time. Because unfortunately I didn't really have time to look at adoptions for this first Tatoeba day :(

Anyway, we'll keep you informed. Thanks again for everyone who participated and who came to our IRC channel :)

Sunday, November 7, 2010

Tags guidelines

We have introduced the "tags" feature several months ago and we've let trusted users experiment it pretty much freely. There has been a profusion of tags created but they are quite a mess and we decided to try tidying up.

From now on, if you are going to tag a sentence, please take into consideration the following things.


1. Use tags for objective and official information

We would like to keep the tags for "objective" and "official" information. If you want to categorize sentences for personal purpose, you should use lists.

For instance, you cannot tag a sentence "French exam" to mark the sentence as part of those you will use to practice before your French exam, you should create a list for that. We know lists are not as practical as tags, but we'll be improving the lists feature as soon as we have time.


2. Avoid creating new tags

Avoid creating new tags because it can make the cleaning process harder. If the tag you want to add doesn't appear in the autocompletion list, then it's a new tag, so don't add it unless you are really convinced it's a valid tag.


3. Ask before you create a new tag

We don't have clear rules yet for what is a valid tag and what is not, but one of our moderators (Swift) volunteered to take care of the tags. If you feel the need to create a new tag, it would be wise to ask Swift first. He will be officially in charge of tidying up the tags. He will be the one deciding what tag to keep or not and what tag to rename. Also, don't hesitate to contact him if you would like to help out. It's not easy to decide on these things.


4. Use English for tags, unless you really can't

We have decided to use English as the default language for tags. We will rename all non-English tags into their English equivalent, when it is possible. We can still accept non-English tags, but only if there is no English equivalent.

The point of having one common language is uniformity. It would be inefficient to have a bunch of sentences tagged "proverb" (English) and another bunch tagged "proverbe" (French). There is also no point having a sentence tagged with both "proverb" and "proverbe". They are the same notion. It can even make things confusing to have several tags to designate a same notion, that's why we have decided to have one default language. We will later implement the possibility to translate the tags and to display them in languages other than English.


5. How things are going to work
  • We'll try to keep the process as transparent as possible.
  • Swift will publish on the Wall the modifications that will be applied to the tags (i.e. renaming and deletions).
  • There will be a few days until these modifications are actually applied, in case people strongly disagree with a decision.
  • Swift will also add on his profile and his personal web page the links to every Wall post mentioning the modifications, for people to be able to trace back all the decisions about the tags.
  • If you need to protest against a decision, please refer to Swift.

Thursday, October 14, 2010

Some stats

I normally tweet whenever a language is reaches an important milestone but I was a bit absent from Tatoeba the past 4 weeks and I didn't really keep track of the progress of each language. So I'm going to sum up everything in this blog post and, while I'm at it, give more general stats about Tatoeba.


New languages

We've added several new languages since September. Tatoeba is now supporting a total of 71 languages. The new languages are:
  • Bosnian
  • Croatian
  • Old East Slavic
  • Chamorro
  • Tagalog
  • Quechua
  • Mongolian
  • Lithuanian

Sentences stats

Top 5
  1. English - 156,000+ sentences. English has taken the first place back in September and things still haven't changed.
  2. Japanese - 153,000+ sentences.
  3. French - 50,000+ sentences. Around 10,000 sentences were added within 2 months. There's progress :) It had taken 3 months to go from 30,000 to 40,000.
  4. Esperanto - 32,000+ sentences. That's 20,000 sentences within 5 weeks. It's really good! It looks like Esperanto will soon outrank French...
  5. German - 27,000+ sentences. German lost its 4th position to Esperanto a few weeks ago.
Other languages with 10,000+ sentences
  • Polish - 16,000+ sentences.
  • Russian - 15,000+ sentences.
  • Spanish - 14,000+ sentences.
  • Chinese (Mandarin) - 14,000+ sentences.
  • Ukrainian - 13,000+ sentences.
Other languages with 1000+ sentences
  • Italian - 7000+ sentences.
  • Dutch - 5500+ sentences.
  • Portuguese - 5500+ sentences.
  • Arabic - 5500+ sentences.
  • Icelandic - 4000+ sentences.
  • Hindi - 3000+ sentences.
  • Shanghainese - 2500+ sentences.
  • Uyghur - 2000+ sentences.
  • Turkish - almost 2000 sentences.
  • Vietnamese - 1500+ sentences.
  • Norwegian (Bokmål) - 1500+ sentences
  • Belarusian - 1500+ sentences.
  • Hebrew - 1000+ sentences.
  • Cantonese - 1000+ sentences.
Number of new sentences per month

Here's the number of new sentences we've had each month, since the beginning of the year.
  1. January - 9,348 sentences.
  2. February - 12,699 sentences.
  3. March - 6,218 sentences,
  4. April - 10,321 sentences.
  5. May - 12078 sentences.
  6. June - 19,484 sentences.
  7. July - 30,257 sentences.
  8. August - 44,782 sentences.
  9. Sep 2010 - 49,148 sentences.
By the way, at the moment we have almost 550,000 sentences in total. We can safely expect to reach 600,000 sentences before 2011. But when will we reach 1 million...?


Visitors and pageviews

Maybe some of you are curious to know how many people visit Tatoeba. Here are some stats for each month since the beginning of the year (provided by Google Analytics).

Unique visitors
  1. January - 5,601 visitors.
  2. February - 7,016 visitors.
  3. March - 7,910 visitors.
  4. April - 7,742 visitors.
  5. May - 7,061 visitors.
  6. June - 8,681 visitors.
  7. July - 12,835 visitors.
  8. August - 12,189 visitors.
  9. September - 22,334 visitors.

Pageviews
  1. January - 50,670 pageviews.
  2. February - 52,888 pageviews.
  3. March - 57,093 pageviews.
  4. April - 85,765 pageviews.
  5. May - 103,089 pageviews.
  6. June - 150,297 pageviews.
  7. July - 211,997 pageviews.
  8. August - 275,796 pageviews.
  9. September - 369,025 pageviews.

Sunday, September 26, 2010

Warning: you are being disrespectful

Translations of this article:


I decided to write more specific guidelines about how to react to bad behavior because I'm so fricken tired of seeing people attacking each other in public.

The community is growing and becoming more diverse. Diversity means divergence of opinions, which means more intense debates. I can accept divergence of opinions, it's normal, it's even necessary. But I cannot accept people flaming each other in public. I don't expect members to act all lovey-dovey with each other, but do I expect members make an effort to be respectful with each other, NO MATTER WHAT.


If you think a user is being disrespectful
  1. Send him a private message with the title "Warning: you are being disrespectful". I insist very much on PRIVATE MESSAGE. Everyone can send this warning, not just moderators.
  2. Add in your private message the link to the comment where the user was disrespectful. I insist again: PRIVATE MESSAGE.
  3. Quote the part of the comment that you felt was disrespectful.
  4. Try to explain why you felt it was disrespectful.
  5. Add a link to this blog post.
Just in case it was not clear, I will repeat the main idea: if you think a user is being disrespectful, send him a private message and ONLY a private message.


If you received warnings
  1. It's possible that it was a misunderstanding from the sender, you can simply explain him what you really meant. But if one person misunderstood, it's possible that other people will misunderstand you as well, so you should consider clarifying your comment for everyone.
  2. It's possible that you are really being disrespectful, in which case you should consider deleting your comment or apologizing for being disrespectful (or both).
NOTE: Moderators cannot delete other people's comments. Don't count on them to censor you.


What do I find disrespectful?
  • Insulting someone is disrespectful, obviously. I don't think I need to explain that one.
  • Being condescending is disrespectful. You should treat everyone's opinion equally. It shouldn't matter whether you're debating with a 6 year-old kid or a non native speaker. You are NOT entitled to trash someone's opinions just because you think you know better. If you know better, then educate people, don't trash them.
  • Lecturing someone publicly is disrespectful. You can tell someone how they should behave in PRIVATE, but not in public, never EVER. Even something small like "Dude, calm down" => PRIVATE MESSAGE.
  • Generally speaking, writing negative comments about someone is disrespectful. If you don't like something about someone, you let them know in PRIVATE and ONLY IN PRIVATE.
Just to be clear, I may myself show lack of respect in moments of weakness. Everyone may. You come back tired from a long day of work, someone offends you publicly, you can't resist the temptation to reply back publicly as well. It happens to everyone. But it is NOT acceptable, there is NO EXCUSE for that.


What happens to people who misbehave?

My thoughts here about bad behavior are still true today. People who misbehave will not be banned, suspended or anything. They will simply receive a lot of warnings and hopefully those warnings can slap some sense into them. I count on EVERYONE to send warnings to users who are crossing the line. It's not only my job, it's not only moderators' job, it's not only trusted users' job, it's EVERYONE'S JOB to make sure Tatoeba remains a place that people ENJOY going back to.

If your inbox starts being filled with warning messages, you really need to work on your behavior. I must remind you that this is a collaborative project, and collaborative means we are working WITH each other, NOT AGAINST. If you care about this project, then please, show more maturity. If you can't do that, then for Tatoeba's sake, take a break and come back when you grow up. Thank you.

Wednesday, August 25, 2010

Tatoeba update (August 25th, 2010)

Small update.

What's new
  • Autocompletion for tags. NOTE: tags are still available only to trusted users so this feature will only affect them.
  • Tags are organized by popularity. The number of sentences tagged is also indicated.
  • Changed a little the top menu. There's a sub-menu for the "Browse" section, to easily browse by language, by list and by tags.
  • Changed the position of the search input, for better usability.

What next

I can tell you we're preparing a new shiny version of Tatoeba. When will it be ready is still unknown, but it's definitely not for tomorrow. It will take several months.
If you're interested in beta testing it, feel free to drop a mail at team@tatoeba.fr, with the title "Tatoeba beta testing". We'll contact you back when the time comes :)

Saturday, August 7, 2010

Tatoeba update (August 7th, 2010)

This post talks about changes that were applied on July 26th, in addition of those done on August 7th.

What's new?
  • We are now displaying furigana for Japanese sentences. It previously looked like this: 私[わたし] に あいさつ する よう な 人[ひと] は い ない ...which was not very practical to read.
  • We have added a filter in the comments section. You can now display only comments that are posted on sentences in a certain language (for instance, only comments on Esperanto sentences)
  • We add new languages regularly, but this week, we're adding a quite special language: CycL. This was request by our member witbrock. I'm very curious to see where this is going to lead...
What next?

API. More and more people have been asking us if we were providing an API. We currently don't, but we definitely want to provide an API someday. I can't say when yet, I don't want to make promises, but I'll be posting progresses as they happen.

Copyright. More copyright issues have been raised lately. So I'll be writing a post about it, to try to explain clearly the issues we are facing related to copyright and what you can do to help.

Tuesday, August 3, 2010

Submission policy - What kind of content do we want?

This article explains what kind of content we accept in Tatoeba, what kind of content we delete and what kind of content we review. Note that this article is not final. You have the right to object to something or to ask for more clarifications.


What do we accept?

Tatoeba is about collecting sentences so we only want sentences. However, what exactly do we mean by "sentences"? What is a sentence and what is not? It's actually a difficult question... No one will doubt that "I am happy" is a sentence. But what about "On the left", is that a sentence? What about "Thank you", "Yes", or "Awesome"?

As far as I'm concerned, I think Tatoeba can handle a loose definition of "sentence". We don't strictly need to have an entity with at least a verb. To me, when spoken, everything is a sentence. When written, the main difference between a sentence and a non-sentence is punctuation. That's all. For the rest, as long as people can imagine context where the "sentence" can be expressed, then it's a sentence.
So yes, I'm roughly saying that you can take all the words in the dictionary, add punctuation and perhaps a capital letter, you'd turn it into a sentence. I don't encourage it because it's not useful (dictionaries do that already), but one-word sentences are still tolerated. I'll trust people's common sense for adding only one-word sentences that are significant (for instance, "Hello" is, "House" isn't).

In case you run across sentences that are not strictly speaking sentences, then tag them as "non-sentence", so that there is a way to quickly identify them. Inform the owner about this article if he's a new member, and let him know it's better to to have sentences with more context.
At any rate, don't bother starting endless discussions if the sentence has already been translated because it will be kept as is. Feel free however to add a new sentence based on the "non-sentence".

Generally speaking, Tatoeba is open to many kinds of sentences. We tolerate casual speech, slang, insults (as long as they are not targeting anyone in particular), erotic sentences, sentences that are not "true" (after all, Tatoeba is not an encyclopedia). These sentences can be tagged accordingly to inform users. But I'll ask people to focus primarily on appropriate and politically correct sentences. We don't have (yet) a good system to filter out sentences that are not very "safe", so don't flood us with those, please.


What do we delete?

What we delete for sure are:
  • Entries that people add by mistake due to our failure to provide a more efficient interface.
  • Sentences that owners themselves requested to delete (because the delete feature is still not available to everyone).
  • Entries that are copyrighted or under a license that is not compatible with CC-BY.
  • Racist comments and personal attacks, if they are really harmful and there is a general agreement that it should be removed.
  • Entries that really make no sense and whose owner won't provide any explanation.
In the perspective of providing better content, I'm also allowing the deletion of "sentences" that are "not really sentences" and came from the Tanaka Corpus, but only under these conditions:
  • The vocabulary is already illustrated in other sentences.
  • There is only the Japanese-English pair, no translation into any other language. We can make an exception for French (i.e. it's still deletable if there is a French translation).
  • All the sentences that will be deleted do NOT belong to anyone.
It may be obvious, but you should avoid translating a sentence that is likely to be deleted... Unless you want to stand against its deletion.


What do we review?

By "reviewing" I mean correcting mistakes. So we correct spelling mistakes, grammar mistakes, bad formulations, etc. We want Tatoeba's data to be used (or at least usable) for educational purpose so we want good quality sentences.

However, the limit between a "correct" and "incorrect" sentence is not always clear and some sentences can generate a lot of debate. In such cases, the final decision belongs to the owner of the sentence.

Remember that Tatoeba allows several translations in a same language, so there is no point fighting endlessly on what is correct or not. Simply add another version of the sentence if you are not happy with the existing one, we don't mind at all having near duplicate sentences (cf. this discussion on the Wall, and more precisely my thoughts on the issue here).

We also don't want any kind of annotations in the sentences. You can find more details in the contributor's guide, rule #9. If you have a good reason to keep your annotations, then please explain it in your comments. Otherwise moderators have the right to edit your sentence two weeks after you have been requested to change your sentence.


What do we link?

Tatoeba's sentences are represented as a graph. Two sentences that are linked together have the same meaning. Linking two sentences in the same language is accepted, but you shouldn't link only based on meaning. The sentences that you link should also have an equivalent "style" and type of speech. Cf. my wall post here.

NOTE: Only trusted users can link sentences.

Saturday, July 17, 2010

Tatoeba update (Jul 17th, 2010)

First of all, I'd like to mention that we've had a lot of traffic lately. Allan published an article on linuxfr.org about Tatoeba, and it sure brought a lot of new people :D
Google Analytics says 1,172 unique visitors on July 17th, while we usually have around 400-450. We're glad to see the server is still doing well despite the quite significant increase of activity!


What's new

We can now import sentences. Since July 4th actually, but I didn't have much time to write about it. The feature is currently only available for moderators, because we cannot safely let everyone import huge amount of data. So the way it works is that you send us your sentences in a simple text file, by email (team@tatoeba.fr), and we import it.

We accept two formats:
  1. Single sentences: each line has one sentence. All the sentences have to be in the same language.
  2. Sentences + translations: each line has a sentence and its translation, separated by a tab (sentence [tab] translation). All the sentences have to be in a same language, and all the translations in a same language. For instance only French-Spanish, and not French-Spanish in one line, and Swedish-Spanish the next line.
IMPORTANT: We release our data under the Creative Commons Attribution (CC-BY) license. We will not be importing your content if it brings up copyright issues or license incompatibilities. I mean, for instance don't send us sentences stripped from textbooks, or sentences that under the CC-BY-SA license (it's not compatible with CC-BY).

So far we imported:
  • ~700 pairs of sentences in Chinese-Shanghainese. In total we have ~900 pairs of sentences thanks to shanghaining.com. The first 200 ones were added by hand.
  • 200+ proverbs in Dutch.
  • 250+ proverbs in Ukrainian.
That's the major thing for the last couple of weeks.


What next?

We still have to import 2500+ pairs of English-Spanish sentences, provided by one of our registered users, Łukasz. And probably thousands and thousands of other sentences, as more and more people discover Tatoeba, and have their own private (or not so private) collections of sentences to share with everyone :)

In terms of features, there will not be much going on in the next couple of weeks. Actually it will depend on the rest of the team, but as far as I'm concerned, I will have other priorities.

There is still a lot of things that can be improved about the current features, and we will keep improving them, but in August we will also start discussing about the next new stuff. I will write more about it when we get there.

Right now I'd just like to say thank you to everyone who gave this project a little bit - or a lot - of their time, of their knowledge, of their encouragements... Because Tatoeba has become an awesome place for language lovers and learners, and for that, the credits really goes to the community :)

Sunday, June 27, 2010

Tatoeba update (Jun 27th, 2010)

What's new
  • Page that lists all the tags. NOTE: It's not organized at all, it's really just for sake of having a page that displays all the existing tags.
  • Page that lists all the sentences in a specific language, with possibility to show only those that are NOT translated yet into a certain language. For instance Japanese sentences not yet translated into English. Useful feature for contributors =)
  • Possibility to filter by language, on the page that lists sentences with a certain tag.
What's next
  • Possibility to import sentences from CSV file. This feature won't be available to normal users. For a start (and I think for a long time), only moderators will have access to it. So anyone who wants to import sentences from a file will have to make a request. Anyway, the main point is that as soon as we have this feature, we will add massively lots of new sentences =]

Friday, June 11, 2010

Tatoeba update (June 12th, 2010)

What's new

I am glad to announce that we are finally introducing... tags!! :D

This will provide a way for people to add meta-data to sentences. For instance "proverb", "formal", "informal", "male", "female", etc. Such information can be very useful for language learners because they cannot necessarily guess such things just by reading the sentence.

Tags will be restricted for a short period of time. Only trusted users will be able to add tags, but everyone can see the tags associated to a sentence. When we feel the feature is ready for everyone, we will allow everyone to add tags.

People will be free to tag sentences with whatever they want. We don't really have any strict rules yet because tags are still new, and we want to see how people use them. But I can at least suggest some basic tags:
  • proverb, archaic, slang
  • formal, informal
  • male, female (to indicate whether the sentence is said by a man or a woman)
  • to delete, to correct, checked (I will talk more about these)
  • controversial, unsafe (to mark sentences that can cause problems, are not suitable for kids, etc).
  • easy, intermediate, difficult (to indicate the level of difficulty of a sentence)
So these are only my suggestions. Again, the tag feature is new, so we will necessarily go through a phase of experimentation before we can clearly set any rule. We count on everyone to try and help us figure out what works best. Feel free to discuss about issues related to tags on the Wall.

A few more things you need to know about tags:
  • You can see the list of sentences associated to a certain tag by clicking on the tag.
  • You can remove a tag from a sentence only if you were the one who added it.
  • Moderators can remove any tag.
  • It's not possible to add twice a same tag for a sentence. If someone has already added "proverb", you can't re-add "proverb".

"to delete" tag

Those tags will help moderators in their work. At the moment, in Tatoeba, only moderators can delete sentences. The traditional way of requesting a deletion was to add a comment to it, and point out that it should be deleted (and explain why). But the flow of comments has increased a lot and it's less easy for moderators to keep track.

So if you come upon a sentence that you feel should be deleted, then tag it with "to delete" so that moderators can easily find them and clean Tatoeba from entries that are not valid. Anything that is gibberish is not valid. Anything that is not a complete sentence is not valid. But then again, we haven't decided what exactly is a "sentence" so it's debatable.


"to correct" tag

In Tatoeba, it is not possible to modify a sentence that doesn't "belong" to you. These sentences are typically sentences that you have added yourself. No one (or almost) can touch them besides you. If someone sees a mistake in your sentence, all they can do is post a comment, and you have to correct it.

But certain members contribute sentences with mistakes and never come back. And for now, no one can correct their mistakes... except moderators. So if you want to help moderators, whenever you come across a sentence that needs to be corrected, that has a comment asking for correction, but even after two weeks, it was still not corrected, then you can tag the sentence with "to correct".


"checked" tag

Before I explain further, I must stress that this tag is experimental. Many times people have asked for a way to tell whether a sentence can be trusted or not. Okay, so now we can tag a sentence as "checked" to indicate that it has been proofread and validated as a correct sentence.

Of course, this raises some of course problems...
  • What if a user tags a sentence as "checked" just for the fun of it?
  • What if a user tags a sentence as "checked" but was tired and overlooked a mistake?
Well, we can't guarantee 100% accuracy. A sentence that is tagged "checked" will simply have a higher reliability rate than one that doesn't, but it won't be 100% (no one can guarantee that anyway).


What's next
  • We will make tags available to everyone.
  • We will add a page that lists all tags, to enable people to easily browse by tags.
  • We will provide a way to merge tags.
  • And many other things, but I will talk about it when the time comes.
In the meantime, enjoy :)

Sunday, May 30, 2010

Tatoeba update (May 30th, 2010)

What's new

This is a small update.
  • We simplified the registration process. If it doesn't bring too much spam, we'll leave it like that.
  • We started reviewing the texts in Tatoeba. There's still a lot of editorial work to do though.
  • We added support for right to left languages (like Arabic). They are not actually displayed right to left.

What's next
  • Import lists from CSV file (I have already mentioned this many times).
  • We will try as well to implement tags for sentences.
But you have to know that we are currently investing more time in promoting the project. That means less time on implementing new features.

The reason is because we have registered to Drumbeat last weekend. Drumbeat is a platform launched by Mozilla earlier this year. It was made for people to promote their projects that can make the Web better and keep it open. The best projects can even get seed funding, and we kind of hope we will :)

Monday, May 24, 2010

Moderators in Tatoeba

Translations of this article:



This is a little guide/FAQ to explain what is the role of a moderator in Tatoeba, and to make sure moderators use their powers wisely.


Why do we need moderators?

Every community needs their moderators, but in Tatoeba more specifically, the problem is that unless you are the admin, you (currently) cannot :
  • delete sentences, not even your own sentences
  • edit sentences that do not belong to you
So with the growing community, more and more sentences are getting in the "delete me" and "correct me" queues (due to members who never come back to correct their sentences).

Moderators are here to help take care of these sentences that no one else can take care of.


What can moderators do?

Moderator can currently delete, edit, link/unlink any sentence. Yes, this is a lot of power, but since contributions are logged and can be seen by everyone, we don't need to worry too much about a moderator going nuts and ruining others' work.

Keep in mind that the moderator's rights are not "stable" yet. We will balance out the permissions over time. For now, we don't really have time, so we'll trust moderators for doing the right things.


When should moderators edit or delete?

Only use your moderator rights as the last resort.

This is especially true when dealing with others' sentences. Some people will gladly let you edit or delete their sentences without having to be notified about it (they may even be annoyed by this). But other people may feel that you are abusing of your powers, not respecting their work, not acknowledging their presence in the project, or whatsoever.

To avoid any kind of conflict, only edit sentences where the latest correction request says "two weeks ago" (or more) and no correction has been made. Only delete a sentence after asking the owner if they're okay with their contribution being deleted.

Basically, give people the time to do their work first, and only if they don't do anything, you can step in.


How do you become a moderator?

You can either ask Trang or wait for her to notice that you are a good candidate to be a moderator. The criteria is that you are at least already a "trusted user". The rest is subjective.