Monday, May 26, 2014

Google Summer of Code projects for Tatoeba

I have mentioned in a previous blog post that we have students coding for us this summer as part of the Google Summer of Code program. The coding has officially started this week on Monday 19th. For people who are curious, here's more about the projects and the students.

Jake's project - Export to Anki deck

I intend to write a web application that will accept an Anki deck from a user and compare it against the sentence database to find sentences where the user will know exactly one new word. The idea is based on the Input Hypothesis. The user will be able to specify tags or certain words they want in the sentences. Then it will compile these sentences into an .apkg (Anki`s file format for importing decks) and send it to the user.
More details:

Pallav's project - Administrative scripts

As we know, the current site has went through a couple of crashes and instabilities and to handle such issues there is a need for multiple administrators. To ease the task of administrators and to make recovery quick and easy, there is a need for some administrative scripts that automates most of the tasks. Since the current repository lacks proper documentation for the initial set up, it becomes very difficult for a new user to set up a development environment. So the project's main aim is to create scripts that simplifies the set-up task and along with that a few supporting scripts that can perform backup, restore, export, import, etc. easily. Also, in order to make recovery from a crash easy and smooth, an ansible based solution is to be created that could help the production server recover from a crash back to its latest stable version.
Along with these administrative scripts, the project extends to implement/improve some export scripts. The existing scripts do not allow on the fly export of dumps and the dumps also lack some useful information. So the aim is to get over with these short-comings as well.
More details:

Saeb's project - A Complete Python Rewrite of Tatoeba

The current codebase is hard to maintain and work with, not to mention it didn't see any major improvements in the past year or two. A rewrite of the codebase to use a higher level language and framework will make it more maintainable, cut down on development time, and attract more developers. Also, a move towards a graph database or the use of graph algorithms on top of a relational database will greatly reduce server load and enhance page response time. Finally, an API will greatly reduce the complexity of interacting with the website and will allow the growth of an ecosystem of external tools and programs around tatoeba. This project aims to achieve all of the above and demonstrate the power of this design through a full featured crossplatform javascript interface that should run on major desktop platforms, a number of mobile platforms, and in major browsers.
More details:
Wall thread:

Harsh's project - A Mass Import System For Open Texts

A lot of open texts are freely available on the internet. For example, each language has its own folklore which is usually open. Also, sometimes the same open texts are available in different languages too, for example a Shakespeare  play. I am developing a system, which will take these open texts as input and churn out sentences which can be useful for the corpus. If the same text is available in a different language, it tries to pair up sentences which mean the same in different languages, so that language pairs can be generated.
A responsive website will also be created which will show these sentences and wait for some user to validate the translations or edit minor mistakes if necessary. This will save a lot of time, and can generate many sentence pairs which will act as a steroid for the corpus.
More details:

Saturday, May 17, 2014

Tatoeba update (May 17, 2014): unapproved sentences and other changes

With this update, we are introducing functionality to help us manage sentences contributed by users who ignore repeated warnings that their sentences violate copyright laws or that their sentences are faulty. This is intended to be a means of dealing with offenders rather than a comprehensive means of ranking users or their sentences. We need this measure to protect those who use and redistribute our sentences. (See .)

Admins (and eventually corpus maintainers) will be able to mark sentences as untrustworthy, after which they will be displayed in red and excluded from downloads. Admins will also be able to mark users as untrustworthy, which will cause new sentences contributed by those users to be marked as untrustworthy as well. In this first iteration, there is no means of indicating via the user interface that a particular user is untrustworthy in a particular language but can be trusted in other languages.

In general, if you see a copyrighted or faulty sentence, you should leave a comment on it. Corpus maintainers can delete these sentences after a warning period. However, if you see that a user is posting substantial numbers of such sentences and does not respond to either comments or private messages, inform an admin.

We are introducing other functionality as well:

(1) We now allow lists of up to 100 sentences to be downloaded. When an attempt is made to download a longer list, a message is displayed that states this maximum, as well as the length of the given list. If you want your lists to be downloaded, we encourage you to make them no longer than 100 sentences long.

(2) We now show up to seven numbered page boxes at the top of a list, rather than the five that we previously displayed. This makes better use of the space we already have available.

(3) When a login attempt fails, we show a message rather than simply redisplaying the page.

Finally, we have added Gujarati to our list of languages.

Saturday, May 10, 2014

New feature: unapproved sentences

We are soon going to release a new feature, and I would like to take some time to talk about it. First of all, here's what this feature will do:
  • Corpus maintainers will be able to mark a sentence as "unapproved".
  • Admins will be able to change the "level" of a contributor. By default contributors have a level of 0, but admins can set this level to -1 so that any new sentence/translation from these contributors are marked as "unapproved".
  • Unapproved sentences will still be in the database and will still be indexed whenever we run the indexation, but will be displayed in red on the website.
  • Unapproved sentences will however NOT be exported into the CSV file that we distribute.
The goal of this feature is to deal with 2 issues:
  1. Bad quality sentences. We want Tatoeba to become more useful for language learners. The problem is that since everyone can contribute sentences and translations, some contributions are not reliable enough for language learning, but maybe not bad enough that it's clear they should be deleted.
  2. Non CC-BY sentences. It often happens that new contributors copy-paste sentences from other language learning sources. This is a problem because Tatoeba redistributes the sentences under the CC-BY license and the content needs to be CC-BY compliant.
Setting those sentences as "unapproved" allows us to warn users that there is an issue about the sentence and they should use it with extra care. This feature will also allow admins to act more quickly when a contributor is somehow polluting the corpus. Admins can lower the level of a contributor so that all their next contributions will be marked in red. The contributor will notice themselves that their contributions are red as soon as they are saved.

This feature can obviously be tuned a lot more. Ideally we should treat differently the bad quality sentences from the non CC-BY sentences. Ideally we should set a different level for each user for each language instead. Ideally we should also have approved sentences, and we can also have different levels of approved and unapproved sentences. We just don't have the time and resources to implement these things right now, but they are part of the next steps.

Thursday, May 8, 2014

Tatoeba & Google Summer of Code

Many of you probably don't know, but Tatoeba has been accepted as a mentoring organization for this year's Google Summer of Code. It means that we will have students coding for Tatoeba during the summer, and they will be paid by Google to do so. More specifically, we will 4 students working on projects related to Tatoeba, from May 19th to August 18th.

There was a first meeting between the students and their mentors 10 days ago so that everybody gets acquainted. I am organizing now another meeting for students to get to know better the community. AlanF and myself will be there, and anyone else who is interested to join is welcome :)

Date and time of the meeting: Sunday 11th, 14:00 UTC.

To participate, join our IRC Channel:
  • server: freenode
  • channel: tatoeba