Language ranking
I've decided to include the "leaders" for each language; it's an interesting information. It should give a good idea of who are the current most influential members in Tatoeba for each language.
NOTE 1: All the leaders are not necessarily references in the language they are leader of.
NOTE 2: The stats only list the languages that have more than 1000 sentences.
Meaning of the fields:
- # → rank of the language.
- code → ISO 639-3 code corresponding to the language.
- language → name of the language (in English).
- total → total number of sentences in the language.
- leader → username of the member who owns the most sentences in the language.
- owns → number of sentences owned by the user in the language.
- %owned → percentage of sentences owned by the user in the language (%owned = owns / total).
# | code | language | total | leader | owns | %owned |
1 | eng | English | 176232 | CK | 54337 | 30,8% |
2 | jpn | Japanese | 154779 | fcbond | 1483 | 1,0% |
3 | epo | Esperanto | 78593 | GrizaLeono | 13694 | 17,4% |
4 | fra | French | 68426 | sacredceltic | 16450 | 24,0% |
5 | deu | German | 58485 | MUIRIEL | 12824 | 21,9% |
6 | spa | Spanish | 37755 | Shishir | 9138 | 24,2% |
7 | pol | Polish | 30856 | zipangu | 22699 | 73,6% |
8 | cmn | Mandarin Chinese | 27869 | fucongcong | 9022 | 32,4% |
9 | rus | Russian | 26757 | Hellerick | 6605 | 24,7% |
10 | ita | Italian | 19329 | Guybrush88 | 8235 | 42,6% |
11 | nld | Dutch | 18746 | martinod | 8140 | 43,4% |
12 | ukr | Ukrainian | 15826 | aandrusiak | 4382 | 27,7% |
13 | hun | Hungarian | 12656 | szaby78 | 3574 | 28,2% |
14 | pes | Persian | 10280 | pliiganto | 3816 | 37,1% |
15 | heb | Hebrew | 10118 | Eldad | 7448 | 73,6% |
16 | por | Portuguese | 9628 | brauliobezerra | 5994 | 62,3% |
17 | ara | Arabic | 7940 | saeb | 5886 | 74,1% |
18 | isl | Icelandic | 7721 | Swift | 7472 | 96,8% |
19 | tur | Turkish | 6434 | boracasli | 3821 | 59,4% |
20 | nds | Low Saxon | 5753 | slomox | 5490 | 95,4% |
21 | dan | Danish | 5032 | danepo | 4628 | 92,0% |
22 | bul | Bulgarian | 4602 | ednorog | 4027 | 87,5% |
23 | uig | Uyghur | 3747 | FeuDRenais | 3263 | 87,1% |
24 | hin | Hindi | 3468 | minshirui | 3463 | 99,9% |
25 | wuu | Shanghainese | 3257 | fucongcong | 1612 | 49,5% |
26 | vie | Vietnamese | 2987 | autuno | 1876 | 62,8% |
27 | bel | Belarusian | 2158 | Demetrius | 2089 | 96,8% |
28 | tlh | Klingon | 2104 | Vortarulo | 2097 | 99,7% |
29 | jbo | Lojban | 2017 | Zifre | 1179 | 58,5% |
30 | yue | Cantonese | 1930 | nickyeow | 1767 | 91,6% |
31 | nob | Norwegian (Bokmål) | 1872 | contour | 1090 | 58,2% |
32 | fin | Finnish | 1585 | Hautis | 502 | 31,7% |
33 | ina | Interlingua | 1537 | McDutchie | 1494 | 97,2% |
34 | swe | Swedish | 1190 | Don | 719 | 60,4% |
Progress
Let's see how the corpus has progressed since last time...
- We've reached our 800,000+ milestone; we're now at 834,000+ sentences.
- The top 5 is still the same.
- Persian and Hebrew joined the 10,000+ family!
- Low Saxon, Bulgarian, Klingon, Finnish and Interlingua joined the 1000+ family!
At this rate, we should reach our 1 million milestone some time around September :)
I just thought of a related statistic to this information about the leaders. Take Icelandic, for example. That Swift character owns 96.8% of all the sentences! In a way, it might be seen as a sign of the homogeneity of the contributions of that language. A single contributor would be more likely to provide a narrow cross-section of the language, thus being an possible indicator of a less representative set.
ReplyDeleteIt could thus be interesting to see how the languages rank with regards to how well distributed among the contributors, sentences in a particular language are. The Gini coefficient comes to mind as a candidate for this.