Tatoeba Project

Sunday, July 20, 2014

Tatoeba update (July 20, 2014)

Small layout change
  • The logo and search bar have been redesigned a little bit. The logo does not have the "Tatoeba.org" text in it anymore, so there is now more space for the content.
  • The search bar has now an adaptable height. This probably has no visible impact for most people, but for those who use some specific font, the search bar would look a bit broken due to the higher height of these fonts, and that's now fixed.

Clicking on a log entry highlights the corresponding translation
  • On the sentence's page (for instance http://tatoeba.org/eng/sentences/show/1), you can now click on the logs to highlight the corresponding translation. This allows to identify more easily the text of the translation, since the logs only mention the translation's ID.
Optimization of "latest contributions" in a specific language
  • You may have experienced a "Gateway" error when trying to view the latest contributions in a specific language (for instance http://tatoeba.org/contributions/latest/fra). The query has been optimized to avoid this error.

That's it for this update.

On a side note, we are aware that Tatoeba has been pretty slow these days. If you are interested in helping us optimize the website, you are always welcome.

Sunday, June 29, 2014

Tatoeba update (June 29, 2014)

  • We now remember the settings you last chose for translating sentences (source language, target language, languages in which to show translations, whether to show only sentences with audio). Previously, we always chose the language in which the user interface was shown as the source language, and we used default settings for the other values.
  • The first two items in the "Not directly translated into" list have been renamed from "None" and "All languages" to "—" and "Any languages", respectively.
  • Language names are now displayed in a more logical order (collation). In languages whose writing system is alphabetic, the new order will group together language names beginning with the same letter, whether or not they are capitalized and whether or not they have a diacritical mark (accent). For other languages, the new order is also more logical.
  • Lists are now identified as "My lists" (editable only by the contributor who is logged in), "Collaborative lists" (editable by anyone), and "Other personal lists" (editable only by a particular contributor who is not the one logged in). This expresses their nature better than "private" and "public", which implied that only "public" lists could be read by others, which was never the case.
  • We fixed a bug where the date was displayed as "Nov 30th -0001, 00:00" when it should be displayed as "date unknown".
  • We now write the correct values for languages to the appropriate table when sentences and translations are imported from a file.
  • We cleaned up the codebase, getting rid of unused files and putting scripts in a more structured arrangement.
  • The website is now HTML 5 compliant. This will enable further improvements.

Sunday, June 22, 2014

Tatoeba update (June 22, 2014)

  • The fields for adding a new sentence are now easier to read and to type into, and clicking outside the field is ignored.
  • Cleaned up the handling of tags as follows:
    • leading and trailing spaces are now automatically trimmed
    • tag names that are too long are now safely truncated
    • tags whose names begin with "@" are not automatically deleted when the number of associated sentences drops to zero
  • Cleaned up names of existing tags as follows:
    • names of utility tags, which indicate sentences that need attention, now all start with "@" ("@check", "@delete", etc.)
    • no other tag names begin with "@"
    • rewrote tag names beginning with "By" so that they all start with "by"
    • removed extra internal spaces
    • fixed some misspelled tags
    • consolidated some other tags that had multiple variants
  • Updated the wiki page on tags.
  • Added link to donation page to top menu.

Wednesday, June 18, 2014

Tatoeba update (June 18th, 2014)

  • There is now a donation page (also read this article).
  • We moved the FAQ to the wiki. The FAQ can now be translated into other languages. If you would like to help translating the FAQ or articles in the wiki in general, contact us. The wiki handles the following languages: Esperanto, French, German, Italian, Japanese, Mandarin Chinese, Polish, Portuguese (BR), Russian, Spanish.
  • We added new items to the user menu (under the user icon, next to the inbox icon). In addition to the existing links that allowed a user to edit his or her profile and settings, the menu now contains links to the user's sentences, comments, messages, and so on.
  • We added a form in the "Translate sentences" page to display sentences in a specific language that are not translated into another specific language. Previously it was only display random sentences.
  • Sentences that are greater than the maximum length allowed (1500 bytes) will now be shortened in a way that prevents characters from being split.

  • We cleaned up an inconsistency regarding sentences in unknown languages in the database and CSV exports. Some entries for sentences in an unknown language had marked the language as an empty string where they should have used a null.
  • Leading and trailing whitespace around sentences are now trimmed.
  • Misspelled tags starting with "@" were removed and their sentences were assigned to the correct tags.

  • Fixed some error messages.
  • Upgrade to Universal Analytics for Google Analytics.


Alright, we have big news: there is now a "donate" page on Tatoeba. The link is still shyly hiding in the footer of the website, because we're not yet making any sort of campain, but it's there :)

What we would need donations for?

On various occasions people have been wondering why doesn't Tatoeba have a donate button somewhere, or why don't we ask for donations. We never really gave an official reply to this, but to make it short: it's because we never really needed it.

I wouldn't say we really need it now (the project is not going to die without donations), but it seems to me that it is not going to grow much more without a financial boost, and that is for two main things: a new a server, and hiring people.

About the server

You may or may not know this, but we never really paid for hosting.

The website currently runs (since April 2010) on a server provided by the FSF France. They host us for free (on the obvious condition that we only run free software on their servers).
Before this, it was hosted for free by an acquaintance of Trang who was paying for a server but didn't use much of it, so he was fine letting some other website run on it. Later on we donated him back some money as a thank you for hosting the project for something like two years.
Even before, it was hosted on a server from Trang's university, again for free.
And even before, it was hosted on the free webspace provided by Sourceforge when you opened a project on their platform. I'm not sure if they still provide this service, maybe Tatoeba's prototype is still accessible somewhere, from some obscure URL.

It's clear however that to fulfill the growing needs of the project, we will have to move to our own dedicated server at some point in a very near future.

We haven't decided on a host yet, but it would probably cost something around 60€/months, and it is something we will do with or without donations. Donations will still be welcome (since this is not a negligeable cost) to cover the expenses and to ensure that Tatoeba will have enough funds to pay for a dedicated server for at least the next few years. For you to get an idea, based on 60€/month, it would cost close to 1500€ to secure 2 years of hosting.

About hiring people

This is more of a long term goal.

The fact is, we do currently have 4 students working on Tatoeba related projects as part of the Google Summer of Code program (3 months during which students are coding on an open source project). You could say they do it as a job more than just a hobby, since they will receive $5500 from Google for completing successfully their projects.

A few years back, sysko had also did a 3 months internship within Tatoeba, as part of his senior year's project. The internship was funded from part of a grant that Tatoeba received from Mozilla. Although, to be honest, I wouldn't count these 3 months as sysko working on Tatoeba as a job, because he would have probably worked just as much on Tatoeba if he had 3 months of vacations instead.

Anyway, trying to get the financial resources through grants or by participating to programs such as Google Summer of Code requires a lot of time. It can also have some limitations. Google Summer of Code takes place only during the summer for instance. What if we just want one person to spend one day a week on maintenance tasks?

Hiring people will definitely cost more than paying for a dedicated server. It depends on the kind of skills and the kind of tasks we would need to be done, but I do not expect Tatoeba to raise enough money through donations to actually have people working on the project as a job anytime soon. I still wanted to mention that if we happen to receive more money than needed to pay for the server, it will be kept in the hope that someday we won't have to rely on volunteer work to get things done.

In short

Tatoeba now accepts donations (woohoo~). We will use the money donated to pay for a new server at first, and if we can raise enough, to pay for people to work on the project.

Sunday, June 8, 2014

Tatoeba update (June 8, 2014)

  • The sentence counts for tags have been corrected.
  • The sentence count for a tag is now properly decremented by 1 rather than 2 when a sentence that contains the tag is deleted.
  • When the sentence count for a tag reaches zero, the tag is automatically deleted. Note, however, that the tag will remain in the autocompletion list until the next time an admin refreshes this list.
  • All sentence lists for tags are now accessible from the list under "Browse all tags". Previously, there were a few tags that when clicked did not bring up the appropriate sentence list.
New interface language
  • Added Marathi as our 20th user interface language. (Thank you for your work, sabretou!)

Monday, May 26, 2014

Google Summer of Code projects for Tatoeba

I have mentioned in a previous blog post that we have students coding for us this summer as part of the Google Summer of Code program. The coding has officially started this week on Monday 19th. For people who are curious, here's more about the projects and the students.

Jake's project - Export to Anki deck
I intend to write a web application that will accept an Anki deck from a user and compare it against the sentence database to find sentences where the user will know exactly one new word. The idea is based on the Input Hypothesis. The user will be able to specify tags or certain words they want in the sentences. Then it will compile these sentences into an .apkg (Anki`s file format for importing decks) and send it to the user.
More details: https://github.com/jakeprobst/iplus1-log/wiki/GSOC-2014-Proposal

Pallav's project - Administrative scripts
As we know, the current site has went through a couple of crashes and instabilities and to handle such issues there is a need for multiple administrators. To ease the task of administrators and to make recovery quick and easy, there is a need for some administrative scripts that automates most of the tasks. Since the current repository lacks proper documentation for the initial set up, it becomes very difficult for a new user to set up a development environment. So the project's main aim is to create scripts that simplifies the set-up task and along with that a few supporting scripts that can perform backup, restore, export, import, etc. easily. Also, in order to make recovery from a crash easy and smooth, an ansible based solution is to be created that could help the production server recover from a crash back to its latest stable version.
Along with these administrative scripts, the project extends to implement/improve some export scripts. The existing scripts do not allow on the fly export of dumps and the dumps also lack some useful information. So the aim is to get over with these short-comings as well.
More details: https://github.com/Tatoeba/admin/raw/master/proposal.pdf

Saeb's project - A Complete Python Rewrite of Tatoeba
The current codebase is hard to maintain and work with, not to mention it didn't see any major improvements in the past year or two. A rewrite of the codebase to use a higher level language and framework will make it more maintainable, cut down on development time, and attract more developers. Also, a move towards a graph database or the use of graph algorithms on top of a relational database will greatly reduce server load and enhance page response time. Finally, an API will greatly reduce the complexity of interacting with the website and will allow the growth of an ecosystem of external tools and programs around tatoeba. This project aims to achieve all of the above and demonstrate the power of this design through a full featured crossplatform javascript interface that should run on major desktop platforms, a number of mobile platforms, and in major browsers.
More details: https://dl.dropboxusercontent.com/u/3186185/tatoproposal-public.html
Wall thread: http://tatoeba.org/eng/wall/show_message/19536#message_19536

Harsh's project - A Mass Import System For Open Texts
A lot of open texts are freely available on the internet. For example, each language has its own folklore which is usually open. Also, sometimes the same open texts are available in different languages too, for example a Shakespeare  play. I am developing a system, which will take these open texts as input and churn out sentences which can be useful for the corpus. If the same text is available in a different language, it tries to pair up sentences which mean the same in different languages, so that language pairs can be generated.
A responsive website will also be created which will show these sentences and wait for some user to validate the translations or edit minor mistakes if necessary. This will save a lot of time, and can generate many sentence pairs which will act as a steroid for the corpus.
More details: https://docs.google.com/document/d/1SS-oEs8BrTY7HkSRru0b1k4J0lBRSE0MqfWAfQICzNI/edit?usp=sharing