Sunday, January 9, 2011

Tatoeba day #2 (Jan 23rd, 2011)

We have decided of a date for our next Tatoeba day, and it will be January 23rd, 2011. Just like the last Tatoeba day, it will start at 0:00 and will end at 23:59 (France time).

There will be 2 objectives for that day:
  • Banners for Tatoeba
  • Quality of the corpus

Banners for Tatoeba

For those who are not sure what I'm talking about, what we call a "banner" is basically an image that represents a website. Let's say you are a fan of Tatoeba and have a personal blog. Because you are very supportive, you would like to put a link to Tatoeba on your blog, and perhaps you would like the link to be graphical rather than simple text. Well, we don't really have any standard image for people to use in such situations and we'd like to create some.

So we're organizing a little contest for our contributors with artistic/design skills (or who just want to give it a try): create banners for Tatoeba!

Everyone can participate and if you want to, here's what to do or to know:
  1. Make 2 images with the following sizes: 88x31 and 392x72. You may re-use our current logo in it, but don't hesitate to make another (better) logo if you are inspired.
  2. Send your 2 images to trang@tatoeba.fr with the title "Tatoeba banners" and indicate in the email your Tatoeba username. I will reply you back to confirm that I have received them.
  3. The current deadline is January 23rd, 13:00 (France time). However, we will need at the very least 5 submissions, otherwise it's not very interesting :P I do hope there will be more than 5, but if there isn't been enough submissions, we will extend the deadline to the next Tatoeba day: February 20th, same time.
  4. Shortly after the deadline I will publish the banners that were sent to me. Then Tatoeba users will have one week to vote for their favorite banners. I'm not sure yet how we will do the votes but I will write about it in due time. IMPORTANT: I don't want people to be influenced by "who made the banner" during the vote so I will not indicate this information when I first publish the banners. I will ask you as well to keep your work "secret" the whole time (don't show it to anyone and don't say "I did this one").
  5. Once the votes are over, I will reveal the participants and announce the winner who will then be venerated forever by everyone for his/her talent :)


Quality of the corpus

Since Tatoeba is open for everyone to contribute, one of its biggest problem is quality. Contributors aren't necessarily professionals and we inevitably have many sentences that contain mistakes or don't sound right. For our 2nd Tatoeba day, we will be focusing on quality. The goal of the day will be to check, correct and improve as many sentences as possible.

We've got plenty of sentences sentences tagged "@Needs Native Check", "@change" and "@check", and it would be really nice to remove as many of these tags as possible to replace them with the 'OK' tag. We've also got plenty of orphan sentences that desperately need parents.

If you want to participate, don't be shy and join our IRC channel #tatoeba on January 23rd (cf. our help page to learn how to use IRC, in case you are not familiar with IRC). This way you can discuss in real time with other members about what to do with a sentence (among other things)!

The next day, I will be publishing the following stats:
  • The number of sentences modified.
  • The number of comments posted.
  • The number of sentences tagged 'OK'.
  • The approximate number of sentences adopted.
I'll be honest though, things might be a little disorganized at first. I don't know yet how many people intend to participate and I don't how yet how we will coordinate with each other to work efficiently together. But this second Tatoeba will be the occasion to experiment and hopefully figure out something :)

NOTE: You may want to read these articles to learn a bit more about how we handle quality, even though they are not up-to-date anymore.

Sunday, December 19, 2010

Projects using Tatoeba

I just received an email asking if we had a page listing projects using Tatoeba and was reminded that we still don't. So here is a beginning of list. Feel free to send us projects that you know of (or are the author of) and are not listed here!

Websites
Mobile apps
Microblogging

Friday, December 10, 2010

Tatoeba update (Dec 10th, 2010)

What's new

Sentences stats. There's now a specific page for the sentences stats, to make them a bit more readable. The total number of sentences is also now indicated (it's a quite important number, but for some reason we never displayed it anywhere).

Wall messages of a user. You can browse the messages that were posted by a specific user, from the user profil. Click on "See this user's contribution", scroll to the bottom of the page. You will see the latest messages posted by the user, and a link to view them all (if the user has posted any message).

Sentences tagged more than 2 weeks ago. That's useful mostly for moderators :)

New languages. We've added Ainu, Malayalam, Low German and Sicilian.

FAQ. In case you haven't noticed, the procedure to request a new language was updated (several weeks ago), and we added a new question, regarding audio.


What next
  • Improvement of the profile page. Because the way one can edit his profile at the moment is not really the most intuitive, nor practical.
  • Very certainly other things but I can't tell what yet because it will depend on my inspiration...

Sunday, November 21, 2010

Tatoeba update (Nov 21st, 2010)

Alright, it's been a long time since we last updated Tatoeba :) This is just a small update.

What's new

"Members" page. This is probably the main modification. We redesigned a little bit the "Members" page to look a bit better and to be less slow. We removed the information about the last login, because some people don't like being spied :P We removed the top 20 ranking because that's what makes the page so slow. Instead we're displaying the members who are currently active (those who participated to the few last hundreds contributions).

Tags info. If you hover your mouse over a tag, you will see the id of the user who added it, and the date when it was added. This is mostly useful for sentences owner, who may wonder why someone has tagged a sentence a certain way. You can figure out who's the user behind a certain id with the following URL: http://tatoeba.org/users/show/[id].

Set language to "unknown". We get requests for new languages quite frequently and we ask people to add a few sentences in the language they request. Except that the language is sometimes misdetected and there was no way to set the language to "unknown" (to indicate that it's a language that is not in the list). Now it's possible. There is an option called "other language", and will set the language icon to "unknown".

Sentence owner's name in comments. It was requested a long time ago, and it's finally here. The name of the sentence owner is now indicated in the comments, next to the sentence itself. This way, when you look at a comment on the homepage, you will not only know what sentence it is associated to, but also the user who added that sentence.


What next
  • We'll be working on a page that lists all sentences that were tagged @change and @delete more than 2 weeks ago. This way moderators will have a simple way to know what sentences they can/should take care of.
  • We'll be adding a page that lists all the Wall messages of a user.
  • And perhaps other random things...

Sunday, November 14, 2010

Tatoeba day & stats

Yesterday was our first Tatoeba day, so today I'm publishing stats about what has been been achieved that day, as well as more general stats.


Stats by language

The chart below shows the number of sentences added on Nov 13th for each language.


The gold medal goes to Arabic! Silver goes to Esperanto and bronze goes to German :)
  1. Arabic (573)
  2. Esperanto (354)
  3. German (247)
  4. Egyptian Arabic (230)
  5. Spanish (207)
  6. Italian (183)
  7. Chinese Mandarin (162)
  8. Hebrew (125)
  9. French (113)
  10. Ukrainian (105)
  11. Danish (100)
  12. Hungarian (78)
  13. Cantonese (78)
  14. English (73)
  15. Russian (70)
  16. Polish (45)
  17. Dutch (36)
  18. Old East Slavic (33)
  19. Lithuanian (18)
  20. Persian (17)
  21. Unknown language (10)
  22. Portuguese (8)
  23. Finnish (7)
  24. Latvian (4)
  25. Vietnamese (4)
  26. Czech (3)
  27. Swedish (3)
  28. Norwegian Bokmål (2)
  29. Shanghainese (2)
  30. Breton (1)
  31. Bulgarian (1)
  32. Catalan (1)
  33. Estonian (1)
  34. Japanese (1)
  35. Quechua (1)
  36. Slovak (1)
  37. Turkish (1)
  38. Uzbek (1)
Sadly, the record set on August 18th of 3465 sentences added was not broken. We only made it to 2899. It's still not bad though, since it's the 2nd most important day, in terms of sentences added (and by "sentences added" I mean "new sentences + translations").

We were missing a few of our devoted members that day, so I guess it's normal. Let's hope more people will be available for the next Tatoeba day :)


Stats by users

The chart below shows the number of sentences added (in green) and the number of sentences modified (in yellow) on Nov 13th, for the top 20 users. You'll excuse my laziness but I only used the number of sentences added for the rank.

Saeb wins the day, by far, with 802 sentences added! Congrats :D Second place goes to nickyeow, and third place goes to Eldad.

At any rate, everyone deserves a big thank you for their contributions! THANK YOU :)

  1. saeb (802/20)
  2. nickyeow (214/20)
  3. Eldad (166/17)
  4. aandrusiak (140/7)
  5. MUIRIEL (138/41)
  6. Guybrush88 (135/2)
  7. danepo (100/12)
  8. GrizaLeono (94/21)
  9. Shishir (94/12)
  10. Dejo (56/11)
  11. Archibald (54/32)
  12. darinmex (53/5)
  13. rado (52/2)
  14. Leono (51/10)
  15. esocom (51/4)
  16. Esperantostern (48/5)
  17. Muelisto (43/1)
  18. kroko (42/4)
  19. Dorenda (41/0)
  20. qdii (40/11)
  21. zipangu 37 2
  22. wondersz1 33 4
  23. Manfredo 27 1
  24. samueldora 24 2
  25. sysko 23 7
  26. szaby78 22 5
  27. Zifre 22 7
  28. cost (21/2)
  29. sencay (20/2)
  30. shanghainese (19/0)
  31. fanty (18/0)
  32. pliiganto (16/13)
  33. BraveSentry (15/1)
  34. pjer (14/5)
  35. U2FS (14/3)
  36. debian2007 (13/1)
  37. Gyuri (12/3)
  38. jxan (12/0)
  39. virgil (12/4)
  40. TRANG (11/32)
  41. slavneui (11/0)
  42. sarah (11/0)
  43. kebukebu (10/2)
  44. Wimmer (10/1)
  45. ae5s (10/0)
  46. Tonari (9/0)
  47. arashi_29 (9/5)
  48. Aleksej (7/0)
  49. CK (5/14)
  50. Shoyren (4/1)
  51. Holyspirit (3/0)
  52. JimBreen (2/0)
  53. luwenzhuo (2/0)
  54. CLARET (2/1)
  55. lajauge (1/0)
  56. ozma29 (1/0)
  57. sschlumberger (1/0)
  58. mr5 (1/0)
  59. Tenshi (1/0)

Language ranks

Tatoeba day is a good occasion to see how each language have progressed. You can see how each language with more than 1000 sentences was positioned one month ago, in this previous post. Let's how it is now...

Top 5

The top 5 hasn't changed.
  1. English - 158,000+. It looks like English has been growing a little bit.
  2. Japanese - 153,000+. Japanese is standing still. You can tell we don't have a very strong Japanese community.
  3. French - 53,000+. French seems keeps moving at a steady pace.
  4. Esperanto - 47,000+. Esperanto is catching up with French quickly...
  5. German - 32,000+. German is progressing better than French, but still not quite as well as Esperanto.
Other languages with 10,000+ sentences
  • Polish - 20,000+
  • Spanish - almost 19,000. Spanish gained one rank! :D
  • Russian - almost 18,000
  • Chinese Mandarin - almost 15,000
  • Ukrainian - 14,000+
Other languages with 1,000+ sentences
  • Italian - 8,500+
  • Arabic - 6,500+. Great boost for Arabic!
  • Dutch - almost 6,500
  • Portuguese - 6,000+
  • Hebrew - 4,500+. Great boost for Hebrew as well!
  • Icelandic - 4,000+
  • Hindi - almost 3,500
  • Hungarian - 3,000+. Hungarian joined the 1,000+ sentences club! Very good progress.
  • Turkish - 2,500+
  • Shanghainese - 2,500+
  • Uyghur - almost 2,500
  • Danish - 2,000+. Danish is new to the club with very good progress as well!
  • Vietnamese - 2,000+
  • Belarusian - almost 2,000
  • Norwegian Bokmål - 1,500+
  • Cantonese - 1,500+

Other numbers
  • 55,735 sentences added in October.
  • About 25,000 sentences added since the beginning of November.
  • We've reached 600,000 sentences in total today!
  • But there are probably thousands of duplicates, so it's not really 600,000 yet...
  • We will soon have 76 languages. 5 are waiting to be added: Galician, Irish, Interlingua, Lojban, Toki Pona. Note that the last 3 languages are constructed languages.

Next Tatoeba day

A potential date for the Tatoeba day would be December 11th. Although it could be December 18th as well. We'll see what suits best for everyone.

The main objective of the first Tatoeba day was to break the record of the highest number of sentences added in one day. We didn't break it, but it's okay because we still had fun :D

The main objective will be different for the second Tatoeba day. We haven't decided what it will be yet, but I think it would be nice to emphasize on adoption next time. Because unfortunately I didn't really have time to look at adoptions for this first Tatoeba day :(

Anyway, we'll keep you informed. Thanks again for everyone who participated and who came to our IRC channel :)

Sunday, November 7, 2010

Tags guidelines

We have introduced the "tags" feature several months ago and we've let trusted users experiment it pretty much freely. There has been a profusion of tags created but they are quite a mess and we decided to try tidying up.

From now on, if you are going to tag a sentence, please take into consideration the following things.


1. Use tags for objective and official information

We would like to keep the tags for "objective" and "official" information. If you want to categorize sentences for personal purpose, you should use lists.

For instance, you cannot tag a sentence "French exam" to mark the sentence as part of those you will use to practice before your French exam, you should create a list for that. We know lists are not as practical as tags, but we'll be improving the lists feature as soon as we have time.


2. Avoid creating new tags

Avoid creating new tags because it can make the cleaning process harder. If the tag you want to add doesn't appear in the autocompletion list, then it's a new tag, so don't add it unless you are really convinced it's a valid tag.


3. Ask before you create a new tag

We don't have clear rules yet for what is a valid tag and what is not, but one of our moderators (Swift) volunteered to take care of the tags. If you feel the need to create a new tag, it would be wise to ask Swift first. He will be officially in charge of tidying up the tags. He will be the one deciding what tag to keep or not and what tag to rename. Also, don't hesitate to contact him if you would like to help out. It's not easy to decide on these things.


4. Use English for tags, unless you really can't

We have decided to use English as the default language for tags. We will rename all non-English tags into their English equivalent, when it is possible. We can still accept non-English tags, but only if there is no English equivalent.

The point of having one common language is uniformity. It would be inefficient to have a bunch of sentences tagged "proverb" (English) and another bunch tagged "proverbe" (French). There is also no point having a sentence tagged with both "proverb" and "proverbe". They are the same notion. It can even make things confusing to have several tags to designate a same notion, that's why we have decided to have one default language. We will later implement the possibility to translate the tags and to display them in languages other than English.


5. How things are going to work
  • We'll try to keep the process as transparent as possible.
  • Swift will publish on the Wall the modifications that will be applied to the tags (i.e. renaming and deletions).
  • There will be a few days until these modifications are actually applied, in case people strongly disagree with a decision.
  • Swift will also add on his profile and his personal web page the links to every Wall post mentioning the modifications, for people to be able to trace back all the decisions about the tags.
  • If you need to protest against a decision, please refer to Swift.

Tatoeba day

Tatoeba has seen its community grow quite significantly in the past 6 months, and it's really encouraging. There was a suggestion about having a "Tatoeba day", a day where (passionate) members would try to contribute more passionately than ever. It's a very good idea so we'll be organizing one every month (we'll try to).

When?
The first one will happen on Saturday November 13th, from 0:00 to 23:59 (France time).

Where?
Well, this is a virtual event, so it happens on the internet... BUT if you want to live this event at its fullest, come to our IRC channel on Nov 13th: #tatoeba, on freenode. Don't be shy! And even if you are shy, you can just drop by to read what's going on.

What?
For the first Tatoeba day, we will start with something very basic. The goal of the day will be to translate, correct and adopt a lot of sentences sentences. Not that it's different from what's already happening every day, but I will publish detailed stats the following day, to give an idea of what has been achieved during those 24 hours.
  • How many sentences added for each language and each user
  • How many corrections made for each language and each user
  • How many sentences adopted for each language and each user

Why?
This event is of course an occasion to be more productive than we usually are, but it's mostly an occasion for members to feel more connected with each other and to have fun! You may also learn a few things about Tatoeba that you didn't know :)