Monday, May 26, 2014

Google Summer of Code projects for Tatoeba

I have mentioned in a previous blog post that we have students coding for us this summer as part of the Google Summer of Code program. The coding has officially started this week on Monday 19th. For people who are curious, here's more about the projects and the students.

Jake's project - Export to Anki deck

I intend to write a web application that will accept an Anki deck from a user and compare it against the sentence database to find sentences where the user will know exactly one new word. The idea is based on the Input Hypothesis. The user will be able to specify tags or certain words they want in the sentences. Then it will compile these sentences into an .apkg (Anki`s file format for importing decks) and send it to the user.
More details:

Pallav's project - Administrative scripts

As we know, the current site has went through a couple of crashes and instabilities and to handle such issues there is a need for multiple administrators. To ease the task of administrators and to make recovery quick and easy, there is a need for some administrative scripts that automates most of the tasks. Since the current repository lacks proper documentation for the initial set up, it becomes very difficult for a new user to set up a development environment. So the project's main aim is to create scripts that simplifies the set-up task and along with that a few supporting scripts that can perform backup, restore, export, import, etc. easily. Also, in order to make recovery from a crash easy and smooth, an ansible based solution is to be created that could help the production server recover from a crash back to its latest stable version.
Along with these administrative scripts, the project extends to implement/improve some export scripts. The existing scripts do not allow on the fly export of dumps and the dumps also lack some useful information. So the aim is to get over with these short-comings as well.
More details:

Saeb's project - A Complete Python Rewrite of Tatoeba

The current codebase is hard to maintain and work with, not to mention it didn't see any major improvements in the past year or two. A rewrite of the codebase to use a higher level language and framework will make it more maintainable, cut down on development time, and attract more developers. Also, a move towards a graph database or the use of graph algorithms on top of a relational database will greatly reduce server load and enhance page response time. Finally, an API will greatly reduce the complexity of interacting with the website and will allow the growth of an ecosystem of external tools and programs around tatoeba. This project aims to achieve all of the above and demonstrate the power of this design through a full featured crossplatform javascript interface that should run on major desktop platforms, a number of mobile platforms, and in major browsers.
More details:
Wall thread:

Harsh's project - A Mass Import System For Open Texts

A lot of open texts are freely available on the internet. For example, each language has its own folklore which is usually open. Also, sometimes the same open texts are available in different languages too, for example a Shakespeare  play. I am developing a system, which will take these open texts as input and churn out sentences which can be useful for the corpus. If the same text is available in a different language, it tries to pair up sentences which mean the same in different languages, so that language pairs can be generated.
A responsive website will also be created which will show these sentences and wait for some user to validate the translations or edit minor mistakes if necessary. This will save a lot of time, and can generate many sentence pairs which will act as a steroid for the corpus.
More details:

1 comment:

Note: Only a member of this blog may post a comment.