Sunday, May 1, 2011

Languages stats and leaders

It's been a long time I didn't publish stats, did I? But thanks to sysko who got us rid of the duplicates yesterday, I'm feeling a bit more comfortable talking about numbers.

Language ranking

I've decided to include the "leaders" for each language; it's an interesting information. It should give a good idea of who are the current most influential members in Tatoeba for each language.

NOTE 1: All the leaders are not necessarily references in the language they are leader of.
NOTE 2: The stats only list the languages that have more than 1000 sentences.

Meaning of the fields:
  • # rank of the language.
  • code → ISO 639-3 code corresponding to the language.
  • language → name of the language (in English).
  • total → total number of sentences in the language.
  • leader → username of the member who owns the most sentences in the language.
  • owns → number of sentences owned by the user in the language.
  • %owned → percentage of sentences owned by the user in the language (%owned = owns / total).

#codelanguagetotalleaderowns %owned
1engEnglish176232CK5433730,8%
2jpnJapanese154779fcbond14831,0%
3epoEsperanto78593GrizaLeono1369417,4%
4fraFrench68426sacredceltic1645024,0%
5deuGerman58485MUIRIEL1282421,9%
6spaSpanish37755Shishir913824,2%
7polPolish30856zipangu2269973,6%
8cmnMandarin Chinese27869fucongcong902232,4%
9rusRussian26757Hellerick660524,7%
10itaItalian19329Guybrush88823542,6%
11nldDutch18746martinod814043,4%
12ukrUkrainian15826aandrusiak438227,7%
13hunHungarian12656szaby78357428,2%
14pesPersian10280pliiganto381637,1%
15hebHebrew10118Eldad744873,6%
16porPortuguese9628brauliobezerra599462,3%
17araArabic7940saeb588674,1%
18islIcelandic7721Swift747296,8%
19turTurkish6434boracasli382159,4%
20ndsLow Saxon5753slomox549095,4%
21danDanish5032danepo462892,0%
22bulBulgarian4602ednorog402787,5%
23uigUyghur3747FeuDRenais326387,1%
24hinHindi3468minshirui346399,9%
25wuuShanghainese3257fucongcong161249,5%
26vieVietnamese2987autuno187662,8%
27belBelarusian2158Demetrius208996,8%
28tlhKlingon2104Vortarulo209799,7%
29jboLojban2017Zifre117958,5%
30yueCantonese1930nickyeow176791,6%
31nobNorwegian (Bokmål)1872contour109058,2%
32finFinnish1585Hautis50231,7%
33inaInterlingua1537McDutchie149497,2%
34sweSwedish1190Don71960,4%


Progress

Let's see how the corpus has progressed since last time...
  • We've reached our 800,000+ milestone; we're now at 834,000+ sentences.
  • The top 5 is still the same.
  • Persian and Hebrew joined the 10,000+ family!
  • Low Saxon, Bulgarian, Klingon, Finnish and Interlingua joined the 1000+ family!
At this rate, we should reach our 1 million milestone some time around September :)

1 comment:

  1. I just thought of a related statistic to this information about the leaders. Take Icelandic, for example. That Swift character owns 96.8% of all the sentences! In a way, it might be seen as a sign of the homogeneity of the contributions of that language. A single contributor would be more likely to provide a narrow cross-section of the language, thus being an possible indicator of a less representative set.

    It could thus be interesting to see how the languages rank with regards to how well distributed among the contributors, sentences in a particular language are. The Gini coefficient comes to mind as a candidate for this.

    ReplyDelete

Note: Only a member of this blog may post a comment.