Tuesday, January 25, 2011

Sentences stats (Jan 2011)

Alright so I published the stats for Tatoeba day #2, and it is also the occasion to publish more general stats about how the corpus is progressing.

Languages ranking

Top 5
  • English - 167,000+. That's about 10,000 more than two months ago. CK has been adding about 2500 sentences from Voice of America.
  • Japanese - 153,000+. Hasn't progressed much ^^
  • Esperanto - ~70,000. It currently indicates over 70,000 in the stats, but there are over 2000 duplicates, so it's not exactly 70,000 yet. However, Esperanto is now the 3rd most important language in Tatoeba! Incredible achievement :)
  • French - 57,000. 4000 new sentences compared to 2 months ago.
  • German - 43,000. 11,000 new sentences compared to 2 months ago. At this rate it will not take long before German outranks French as well :P

Other languages with 10,000+ sentences
  • Spanish - 25,000+. 6000 new sentences compared to 2 months ago, and gained one rank :D
  • Polish - 24,000. Lost its rank to Spanish, but still gained 4000 sentences.
  • Russian - 22,000+. Also gained 4000 sentences, and is still at the same position.
  • Chinese Mandarin - 16,000+. Gained 1000 sentences, still at the same position.
  • Ukrainian - 15,000+. Gained 1000 sentences, also remained at the same position.
  • Italian - 14,000+. Remains at the same position but finally reached the 10,000 milestone :D And gained 5500 sentences since last time.
  • Dutch - 12,000+. Dutch also joined the 10,000 family! Gained one rank and pretty much doubled in quantity.
  • Hungarian - 10,000+. 3rd language to join this category! Very fast progression. Gained 7000 sentences and is ranked 12th while it was ranked 18th last time!
Other languages with 1,000+ sentences
  • Hebrew - 8,000+. Pretty much doubled in quantity as well!
  • Arabic - 7,500+. Has been slowing down. Only 1000 new sentences compared to last time.
  • Portuguese - 7,000.
  • Icelandic - 5,500+.
  • Persian - 5,000+. Persian is new here! Maybe we'll see it in the 10,000 category in a few months :)
  • Danish - 4,500+.
  • Hindi - 3,500.
  • Turkish - 3,300.
  • Uyghur - 3,000.
  • Shanghainese - 2,700.
  • Vietnamese - 2,600.
  • Belarusian - 2,000.
  • Cantonese - 1,700.
  • Norwegian (BokmÃ¥l) - 1,600.
  • Lojban - 1,100. Lojban is new here as well!
  • Swedish - 1,000. And Swedish too!

Other numbers
  • We've reached the 700,000 milestone this month! Although we have 6000+ duplicates sentences so it's not really 700,000 yet.
  • We're currently supporting 83 languages.
  • We have 8000+ sentences with audio.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.