Saturday, March 6, 2010

Tatoeba update (Mar 6th, 2010)

Ah, finally an update that will be integrating "real" changes. Here's a short description of the new stuff.


Possibility to indicate the language

Until now, when you wanted to add a sentence or a translation, you had no way to indicate in which language you were contributing. The language was auto-detected though, but it was still a bit puzzling the first time you try to add something. Most people were probably thinking "But how will they know what language... oh okay, it's auto-detected". But more importantly, we could not really consider supporting languages that are not supported by Google's language detection tool (which we are using). Users would have to indicate manually the correct language everytime, and that would be annoying.

This is a small but important feature we wanted to have in Tatoeba for a long time and it's finally here.


Adopting in place

There's one important concept in Tatoeba: you can only modify a sentence if you are the "parent" (owner) of that sentence. You are by default the parent of the sentences you add, which implies only YOU can modify your sentences (which makes sense). But you can also become the parent of a sentence by "adopting" it. Because many, many sentences in Tatoeba do not have any parent. The reason why you'd want to adopt a sentence is because you noticed a mistake and want to correct it, and you wouldn't be able to do this without being the parent of that sentence.

The adopt feature was quite "heavy" to use. Everytime, you were redirected to the "info" page of that sentence. Now it can all be done in one place. Click on the adopt icon, and there you go, no redirection, you can modify it right away. This should make the tasks of correcting sentences less annoying.


Only main sentence displayed when translating

This should solve a problem that we've had for a long time. Users who are not familiar with the system tend to add translations without caring what they actually add their translation to. Many times, people were adding a translation to a Japanese sentence when they were in fact translating from the English sentence. And because of the way things are displayed, they think "okay, I'm just adding one sentence in that box". But it's not the way things work in Tatoeba... Hopefully this will make things clearer.


Possibility to delete comments

Yes, now you can delete your comments (comments on sentences as well as comments on the wall). You cannot edit them yet though. That will be for next time (probably). Be careful though! Deleting a comment will delete it forever.
We haven't made a page that lists all your comments yet, but you can go through all the comments in Tatoeba here.


What next

Well I'm looking at our todo list, and it's hard to say... I'd rather let it be a surprise ;)

Tuesday, February 23, 2010

How to be a good contributor in Tatoeba

Introduction

This article is a must-read for anyone who is serious in about contributing in Tatoeba. It is quite long, so here is a summary of how to be a good contributor:
  1. Understand the context of the project
  2. Understand how the corpus is structured
  3. Do not pay attention to the other translations
  4. Do not translate word for word
  5. Do not change the meaning of a sentence
  6. Do not change the language of a sentence
  7. Make sure you are adding comments to the right sentence
  8. Do not add sentences from copyrighted content
  9. Do not annotate sentences
  10. Give us feedback
  11. Do not wait for us to code it if you can code it
  12. Indicate your languages in your profile
  13. Encourage and educate new (or even not so new) contributors
  14. Spread the love


    1. Understand the context of the project

    I will (someday) write a more detailed (his)story, but here are the basic facts you should be aware of.
    • I started this project in 2006. The initative was driven by a passion for language learning and the frustration of not finding an adequate online dictionary. 
    • The project is focused on sentences and I insist on sentences. The reason is that I felt example sentences was (and still is) a very scarce resource. Please only add complete sentences if you are going to contribute.
    • I was actually "alone" on this project for some time. It was only three years later, in 2009, that other people (all computer science students) started to help me out coding more features.
    • Tatoeba is NOT a commercial project. We're not a company, we're not paid for doing any of this. It is is something that we're working on in our free time.
    • To be honest, we don't exclude the possibility to start a company someday, but that is if and only if we have an innovative, coherent and ethical business model (yea, good luck). Things like having ads everywhere and drive a lot of traffic, or forcing people to pay to access the data is out of the question.


    2. Understand how the corpus is structured

    This is the tricky part, and hopefully I can explain it clearly enough for everyone.

    The corpus is not structured as a table but as a graph. What does it mean? Well, imagine you had to extract part of the corpus and write it on paper. What you would certainly do is something like this:

    English French Spanish
    My name is Trang. Je m'appelle Trang. Me llamo Trang.
    How are you? Comment vas-tu? ¿Cómo estás?
    ... ... ...

    That's a table structure. There are rows and columns: a same row contains sentences with the same meaning, a same column contains sentences with the same language. That's the first approach anyone would have, but that's NOT how the corpus is structured.

    This is how the corpus is structured:



    That's a graph structure. There are nodes and edges: each node represents a sentence, and each edge represent the link between two sentences. When two sentences are linked, they have the same meaning.

    The way you will contribute would be very different from a structure to another. One important implication is that you can add multiple translation in a same language for a specific sentence. You think there are two ways to translate a sentence and you really can't decide which would be the best? Well, just add both!

    Some other implications are pointed out below.


    3. Do not pay attention to the other translations

    When you translate a sentence, you are in fact adding a sentence (a node) and adding a link (an edge) between the "original" sentence and your translation. So the only thing you need to care about is that you are adding a proper translation to "main sentence" (the one at the top, written in bigger size).

    More concretely, if you were in this situation and wanted to add a Spanish translation to the English sentence:

    How are you?
    => Comment vas-tu?

    You could add "¿Cómo estás?" (casual) as much as you could add "¿Cómo está usted?" (formal). Or you could add both (because you can add multiple translations in a same language).
    If you understand French, it doesn't matter if the French sentence is the casual form, you only have to worry about the fact that your translation is a proper translation of the English sentence. A proper translation means that if someone had to translate your contribution back to English, "How are you?" would be a possibility.


    4. Do not translate word for word

    We are not interested in having sentences that sound like they were written by a robot. We want sentences that really are what a native speaker would say. Translating is a very difficult task, we know it. But if you are translating into your native language, you should always, always re-read your translation as if it was a single sentence, and ask yourself if it is actually something people would say. You can use the comments to indicate a literal translation.

    However, if you are not translating into your native language (which you can), you are forgiven for not writing native-like sentences. It's a collaborative project after all, and someday (hopefully) a native speaker will come accross your contribution and see if it sounds right to them or not.

    The point is to understand that Tatoeba is not only about providing translations, it's also about gathering data about a language. Tatoeba could simply be limited to adding sentences without translating them at all. If we were to extract only the sentences in Italian, we would like that each of them are representative of the Italian language.

    The sentences are the basic layer. The links between the sentences is another layer. But the corpus should make sense without those links.


    5. Do not change the meaning of a sentence

    If you were viewing the corpus with a table structure, you would be tempted to change a sentence so that its meaning fits with all the other sentences. But that's obviously not a good idea.

    For instance:

    My name is Trang.
    => Je m'appelle Trang.
    => Vamos a la playa.

    You notice that the Spanish sentence (which says "Let's go to the beach") has nothing to do with the English sentence.

    Perhaps you don't speak Spanish very well so you're not confident in modifying the Spanish sentence and decide to change the English sentence. Problem: what about the French sentence? It won't fit the English sentence anymore...

    Perhaps you are a native Spanish speaker and decide to change the Spanish sentence. In this particular case, it would still be acceptable because the Spanish sentence is not linked to any other sentence. But if someone had translated that Spanish sentence into Italian, "correcting" the Spanish sentence would cause a conflict with the Italian translation.

    Then there is a problem you may have not thought of: when changing the meaning of a sentence, you are potentially erasing unique vocabulary. What if the Spanish sentence was currently the only one with "playa" in it?

    So the best way to proceed in this kind of situation is to add a new Spanish translation (Me llamo Trang) and "unlink" the current Spanish translation.

    Note that at the moment, there is still no way to link or unlink sentences. So if you come across a sentence that needs to be unlinked and it frustrates you not to be able to, you have to bear with it. The "link/unlink" feature is a top priority in our todo list though.


    6. Do not change the language of a sentence

    If the language flag of a sentence is wrong (for instance it was flagged as Chinese when it is in fact Japanese), then of course, you can change it. That's not what I mean by "Do not change the language".
    What I mean is that you shouldn't replacing a Japanese sentence by a Chinese sentence with the same meaning (and that applies to any language of course). It shouldn't often happen, but if you're in a situation where you want to do that, then don't.

    The problem is that a sentence can be associated to data that is dependant on its language. For instance comments. People can post comments on sentences, and the comments may be valid only because the sentence was in a certain language.

    At the moment it is more an issue for Japanese sentences, which are associated to some sort of annotations. These annotations are not displayed because they are not useful for normal users. If you change a Japanese sentence into an English sentence, then the annotations that were associated to it won't make sense anymore.


    7. Make sure you are adding comments to the right sentence

    When you post a comment, the comment is only associated to the main sentence, so make sure that your comment is related to that particular sentence. Typically, if you want to point out a spelling mistake, like here:

    My name is Trang.
    => Je m'appel Trang.
    => Me llamo Trang.

    You can see that the French sentence is wrong. It should be "appelle" and not "appel". If you post your comment here, it would be associated to the English sentence (because it's at the top, so it's the main sentence). This is not what you want. The right thing to do is to click on the French sentence first. It will change the configuration into:

    Je m'appel Trang.
    => My name is Trang.
    => Me llamo Trang.   

    And then you can post your comment.

    Now there is the case where you want to point out that a translation is wrong. You comment will be related to two sentences, so where should you post it? Well, ideally, for this type of situation, there should be the possibility to comment a link between two sentences. But we don't have that, we can only comment a sentence. So you are free to decide where you want to post your comment. Just remember that it's good as long as your comment is related to the main sentence.


    8. Do not add sentences from copyrighted content

    We are distributing the corpus under the Creative Commons Attribution (or CC-BY) license. It makes it possible for anyone to re-use this data in any way they want as long as they mention Tatoeba in their work.

    As a contributor, you have agreed with the terms of use (which of course you haven't read), and therefore you are providing your contributions under the CC-BY license as well. Which means we can re-use your data in any way we want as long as we mention you. So we are re-using your work in Tatoeba, and we mention you through the logs and the stats.

    But providing your work under CC-BY means you also have some responsibilities on what you provide. And you have to know that you cannot legally redistribute data if it was copied from a source that doesn't clearly state that you can do it. Typically, you cannot (legally) copy all the sentences from a textbook and add them into in Tatoeba.

    Don't worry, you (and we) won't get in jail and be in debt for life if you've added a couple of sentences from a textbook (hopefully...). But the law forbids us to take the work of someone and re-use it without their consent. Producing sentences and translations is work, so be careful where you get the sentences from. Preferably, come up with your own sentences or take them from books that are in the public domain.

    If you have added or have seen sentences that were copied from a copyrighted material, change a few words so that it won't be exactly the same sentence. Or, go negotiate with the authors and convince them to release their work under the CC-BY license so we can re-use it.

    I'm not going to argue on whether all of this makes sense or not (obviously I don't believe it does), but it will help us a lot if everyone did the necessary so we don't get sued.


    9. Do not annotate sentences

    Sentences should remain as "raw" as possible. Adding extra information in the sentence itself (like the reading of a word) can be a problem for people who are using our data in order to improve a natural language processing system for instance. You have to keep in mind that the sentences can be re-used for other purposes than how Tatoeba uses them, and therefore people who re-use them may not want to have these annotations.

    So for now, the only solution to add extra information is to use the comments (as unpractical as it may be). When we have a clearer idea of the types of information that people want to attach to the sentences (and when we have time), we will integrate the necessary features for it.


    10. Give us feedback

    We know that Tatoeba is not perfect so don't hesitate to tell us what you think is missing (just make sure no one has talked about it on the Wall already). Also tell us if you see any spelling mistake, feel that some explanations are not clear, or encounter bugs.

    We also know that Tatoeba is a cool project so feel free to tell us you like it too :P


    11. Do not wait for us to code it if you can code it

    As much as we welcome feedback, we welcome even more INITIATIVE. There are just sooo many things we could do. We can't take care of everything.

    For instance we are distributing the entire corpus, but many people probably don't need all the sentences in all the languages. You may just want the English-Spanish sentences. Well instead of asking and waiting for us to provide a file with only English-Spanish sentences, you can code a tool (and please, tell us if you do) that will extract only what you want from the our files.

    That's just one example but if you are a programmer, there could be many things you could do yourself instead of waiting for us to do it. But of course, tell us so we don't start working on something you plan to work on.

    You also have to know that we are actually open source (under AGPL license) but we are not really "promoting" this aspect because:
    1. The code hasn't met my standards of elegance yet... Still too many parts that make me cringe when I look at them.
    2. We still don't have a sound methodology and organization in our way of working and I really don't have time to manage more people.
    However if you love the project and are really motivated to join the development team, then feel free to contact us =)


    12. Indicate your languages in your profile

    For people who didn't know, you can edit your profile by clicking on your username (at the top, in the menu bar).

    Since Tatoeba involves languages, it can be very useful for other users to know which languages you can speak and how well you can speak them. We don't have a specific "languages" field so you will have to write about it in your profile description (in the section "Something about you").

    And tell other users to indicate their languages as well (if they haven't already), especially if they have contributed.


    13. Encourage and educate new (or even not so new) contributors

    The community is very important in a project like Tatoeba, we just can't achieve the ambition without a strong community. But how do you build a strong community? Well, one thing is NOT to make new users feel lost and isolated.

    Part of this depends on the system. It has to be designed in a way that not only enables but also encourages users to interact with each other. Tatoeba is not great at that, but you have the minimum (private messages, wall, comments).

    And the other part depends of course on the community itself. There must be an effort from the community to build a stronger community. So if someone is asking a question to which you can answer, don't hesitate to help out. If you notice someone is going something wrong, don't hesitate to tell them the right way to do it. If you notice someone or some people have been contributing significantly, don't hesitate to drop a line (in a private message or on the Wall) to say "congratulations" or "thank you" for their work.

    More generally speaking, if you have any idea on how to make Tatoeba a more socially pleasant place to be, then go ahead!


    14. Spread the love

    Last but not least: you love the project, we love the project, we all want this project to become the greatest language tool of all time, so bring more people into this adventure!

    In the end, anyone who knows how to read and how to write can participate. There's no need to be a polyglot. If you can "just" hunt for mistakes and correct them or point them out, it will be already extremely helpful. The more people, the more mistakes we can take down, the more data we can produce that people can rely on. And everyone can live happily ever after.

    Monday, February 8, 2010

    Tatoeba update (Feb 8th, 2010)

    There hasn't really been any new features in Tatoeba for a while now, and this update is... not going to bring anything new either so I won't talk much about it. It was mostly about cleaning our code source and optimizing a few things, so there is no visible change. But the next updates will bring "real" new stuff :)

    Sunday, December 13, 2009

    Tatoeba update (Dec 12th, 2009)

    This may be the last "big" update for 2009. There isn't really any new feature contrary to the release we did a few weeks ago, but there are some important changes.


    Creative Commons Attribution license

    Tatoeba is a project that collects sentences, and we are very nice so we redistribute those sentences for free to the rest of the world -- and that's over 300,000 sentences.

    All these sentences do not come from nowhere. They are based on the Tanaka Corpus, a corpus compiled by professor Yasuhito Tanaka at Hyogo University in Japan. Since professor Tanaka released his data under the public domain, I thought I would leave also Tatoeba's data in the public domain as well, which is what I did until recently... But guess what, I'm not allowed to do that. I don't want to get into details but as a French citizen, I cannot legally put my own work under the public domain (just like any other French or European contributor in Tatoeba). That's just how the law is (c.f. Wikipedia for those who can read French).

    Now there is something called CC0 that could potentially be used to get closer to the public domain. But my team and myself are not big specialists on this topic, and we are not sure if it is (legally) safe for us to use this license. So until then, we will redistribute the data under CC-BY. If this is a real problem for you, please let us know. Not that I will be able to find a solution to your problem, but at least I will be aware of what type of problems the CC-BY license can involve.


    Tatoeba, new home for the Tanaka Corpus

    For those of you who are learning Japanese, you certainly have (at least) heard of Jim Breen's WWWJDIC. The Tanaka Corpus was initially edited and maintained over there, and most of it was done by Paul Blay. But around november or december last year (2008) he decided to pull out of this task.

    Paul Blay was also an active contributor in Tatoeba and we made sure that the content in Tatoeba related to the Tanaka Corpus was synchronised with WWWJDIC's version.

    Ever since Paul Blay left, there hasn't really been any work done on the corpus. Recently, I suggested to Jim Breen to use Tatoeba as the new platform to maintain the corpus, to which he agreed. So I can at least announce to those who are willing to improve the Tanaka Corpus : Tatoeba is now the place to go.


    ISO 639 alpha-3

    This isn't going to have much impact on you unless you are planning to use Tatoeba's data: we're updating the language codes to ISO 639 alpha-3.


    What next?

    One thing is for sure, we won't be introducing new big features until past mid-January. Most of the members of the team are still students, and are going to reach soon the (so-much-loved) end of their semester (me included). We won't have the time to test and debug properly before our final exams are over.

    As for what we are planning exactly, I'll write more about this in a next post.

    Saturday, November 28, 2009

    How can Tatoeba be useful for language learners?

    In response to byzantinist who asked on Twitter :
    How can Tatoeba be useful for language learners, other than learning by adding sentences or using the sentences in SRS?
    Honestly, at the current state of the project, there isn't much more than that. 
    There are no grammar explanations, there are no lessons, there's not much people to ask for help, there's not much data in languages other than English and Japanese, and the data is not even always reliable... Generally speaking, you can use Tatoeba as a complement to your language learning, but by itself there is no way it will teach you a language. (Yea, I know, I'm not very good at marketing :D)

    But the project has not reached its full potential yet. My team and I have a lot of ambition, we just don't have a lot of free time...

    To give you a better idea of the context, when I started the project I was very frustrated with online dictionaries and I had this vision of a "dictionary" in which whatever you are searching for, it will always provide you with results. Most importantly, it will always provide you with example sentences (and their translations). That would be useful, right?
    And I felt : why can someone make a collaborative encyclopedia (c.f. Wikipedia), but no one is trying to make a collaborative "dictionary"? Because obviously you can't build a "dictionary" like this unless you have at least thousands of people who are supporting your vision and who are willing to contribute.
    To make a story short, I spent a few years building the tool I envisioned (at least its core). Lately some other people got involved and are helping me make this project grow faster. And now, we're reaching a phase where we really need a community, and preferably a smart one... I mean, if we cannot gather a bunch of dedicated and knowledgeable people to participate, then the project is going to remain useless despite its huge potential.

    Basically, at the moment, the project is focused on building a community and gathering/organizing data, more than on integrating language learning features (although we would love to). But there is no point building a language learning application if you don't have any good data to use... And the sad truth is that it's kind of difficult to find good data. So the concept is : we gather a lot of data, try to organize it, ensure it is of good quality and make it freely accessible, downloadable and redistributable, so that anyone who has a great idea for a language learning application (or a language tool) can just focus on coding the application and rely on us to provide data of excellent quality. And again, all of this is only possible with a strong community...

    However, if you have any specific ideas of features that would make your language learning experience better, just let us know and we'll be glad to work on it. I mean, the project is constantly evolving. As far as we're concerned, we're not going to stop integrating new features anytime soon.

    Sunday, November 8, 2009

    New release!

    There has been a lot of things going on in Tatoeba within the last couple of months. I never took the time to blog about it, but since we are planning a new release for next week, perhaps it's time I communicate a bit more...

    First of all, note that I said "we are planning" and not "I am planning". Indeed, in September (and even a few months before that) the project took another dimension, for I was not alone anymore!
    Of course, I never really was alone, but I was - until then - pretty much the only one to implement the next new features, the next improvements, the next design, and also the only one to be officially responsible of promoting the project (which is something I had absolutely no time for, actually)...
    But now we are 5 (brilliant, dedicated, innovative and incredibly charismatic young geeks! *hum* yea). Needless to say, I'm quite happy with this new team :)

    So what are we planning for this release?
    • A new design (prettier than the current one - hopefully :P).
    • The "list" feature, that will enable users to create lists of sentences.
    • Users profiles, for those who would like to tell more about themselves.
    • Private messages, so that users can contact each other.
    • And some sort of "Wall" (c.f. Facebook), to develop the community aspect of the project.
    We will then start promoting the project more intensively, because this project NEEDS a community. There is SO MUCH that can be done to provide linguistic data and language tools that are free and of good quality. As far as we're concerned, we're digging the technical side, we're trying to set up the "ideal" system for this purpose. But obviously, without a strong community, nothing can be achieved. So we do hope that more and more people will join the cause.

    On a side note, we also participated to the first "innovation competition" organized by the Université de Technologie de Compiègne (where three of us, myself included, are currently studying). And we have been selected in the first round (second round in mid-January).

    Saturday, July 25, 2009

    Activities for language students

    A few months ago I had written about the idea of involving language teachers in the project. Before I illustrate a few ideas of how things can work, I'd like to answer to a question that you may ask yourself (if you are a language teacher) : why should I participate?

    It's simple : Tatoeba can become a very useful resource... if only it had more data. The only way to gather the necessary data is to have a community working on it. You can be part of this community.
    The data can be used for people who simply need to translate something, but also for programmers who would like to code language learning applications, and as well for researchers working on problematics related to language processing.
    I'd also like to mention here that Tatoeba is quite an ambitious project, and what's more, a non-profit one. We really need as much help as possible.


    Now, if you're a bit convinced, what can you do to help...


    Translating any sentences

    The most basic assignment would be to ask each student to translate a certain number of sentences every week. Since there are a lot more English and Japanese sentences, these sentences would usually be the "source" sentences. So if you're a French teacher in Japan, you would get your students to translate Japanese sentences into French. If you're an English teacher in France, you would get your students to translate English sentences into French. And so on. I'm focusing on French because it is my goal to translate all the English/Japanese sentences into French. But you could as well be an English teacher in Spain, and ask your students to translate English sentences into Spanish.

    The only work you would have to do as a teacher is to check that everyone has contributed the amount of sentences you've required them to translate. The result is, even if you only have 20 students, and they each translated only 10 sentences each week for 10 weeks, that would still be 2000 additional sentences in the end.

    However this type of activity may not be very connected to the things you are teaching during your classes. It only has the advantage that it will help increase quickly the amount data.


    Translating sentences with specific words

    Instead of translating just any sentence, you could ask your students to translate sentences with specific words in it. These words would of course be part of the vocabulary that you would like your students to learn.

    This activity would be more useful in an academic context because the data can be used for studying. Students can retrieve the sentences they have translated, or that their peers have translated, in order to review vocabulary before a test.


    Adding sentences with new vocabulary

    What if there are no sentences with the vocabulary you'd like your students to learn? You can always ask them to find a sentence somewhere on the Web, add it in Tatoeba, and translate it.


    Teaming up with another teacher

    The problem with asking students to translate sentences, is that they are likely to make mistakes when they don't translate into their mother tongue. Ideally, when a student translates into his learning language, a native speaker should check that the translation actually means something.

    I think the best way to do this is to team up with another teacher. In you were an English teacher in France, you would team up with a French teacher in the U.K. for instance. Your students would be checking French translations added by their English partners, and the British students would be checking the English translations added by your students. In the process they can help each other by explaining why it is wrong and what would be better (considering the students have enough of a decent level in their learning language so they can communicate with each other).


    So these are only a few ideas, and of course we could always come up with more sophisticated things.

    Anyway, if by any chance you are a language teacher and happen to be reading this, contact me!