Wednesday, December 17, 2014

Tatoeba update (December 17th, 2014)

Small update

  • We fixed the problem of the languages not being displayed on the translate page, in the list for random sentences.
  • We fixed an issue where a sentence was not part of the search results, even though it had been indexed previously. This happened when the sentence was recently translated, or for which the owner or correctness has changed.

UI translations

I'd like to mention that we now have the website interface translated 100% in 7 languages: Arabic, Esperanto, Finnish, French, German, Italian and Russian.
We have as well Marathi (97%), Japanese (92%) and Polish (90%) not far from being completed.

Sentences deduplication

We're delaying once more the sentences deduplication. There are still some details we'd like to fix. Even though they are not critical and the deduplication itself is working properly (as far as we know), it's better to fix them sooner than later.

When everything is fixed, there will be another round of deduplication on the dev website. We will leave a few days again for everyone to check that there's indeed no major issue. Then we will run the script finally on the real website.

Thank you for your patience.

Saturday, December 6, 2014

Tatoeba update (December 6th, 2014)

Bug fix for login redirection

The problem with the redirection to a random sentence's page when logging in is now fixed. When you log in you will now stay on the same page you were at.

UI translations

Gillux implement various improvements for the UI translations. You can read more details about it in his post on the Wall.

Sentences deduplication

We're not forgetting about the deduplication script. We have still a few issues to fix before we can run it. There are no big issue though, so we can probably start deduplicating sentences next weekend.

Friday, November 28, 2014

Tatoeba update (November 29th, 2014)


Tatoeba will be under maintenance for approximately at least 3 hours.
Scheduled time: 04:00 to 07:00 11:00 UTC.

As mentioned in the previous blog post, we need to shut down Tatoeba for a few hours in order to do some changes in the database. These changes are needed for us to run later on the sentences deduplication script.

While Tatoeba is down, if you feel like translating, we always need people to help us translate the website interface.

Edit: The maintenance is over now. It took more time than planned due to MySQL logs still being activated during the changes, but everything went okay.

More frequent indexation

We do not have a lot of new things for this update, but there is nonetheless one piece of good news. Thanks to some optimization that gillux did a couple of weeks ago, we can afford to index new sentences more often. The interval was previously set to 1 hour, and we reduced it to 15 minutes. In other words, you will never have to wait more than 15 minutes to be able to find your sentences via the search function.

Donations and thanks

We recently received a small donation from Stanislav. So thank you, Stanislav! And thanks again to everyone before him who donated to make sure that Tatoeba will be hosted on a stable and fast server for the next few years :)

Sunday, November 23, 2014

Tatoeba update (November 23rd, 2014)


  • Regarding the link feature for advanced contributors: it is now possible to drag-and-drop the icons (instead of the sentence text) into the link icon in the menu.
  • Our assets files (images, CSS, javascript) now have a timestamp, so that the browser knows whether or not it needs to update them. This means you should no more have to worry about clearing your browser's cache.

Development website

Gillux recently set up a development (dev) website. The purpose of the dev website is to let members test new features and check the interface translations BEFORE they get released into the production (prod) website, that is the actual Tatoeba website.


We are planning to disable Tatoeba temporarily next weekend (November 29) for maintenance.
The maintenance is about changing the engine of our MySQL database from MyISAM to InnoDB. For this operation we need to stop access to the database, that's why we need to shut down Tatoeba. It should take around 3 hours.
We need to do this change in order to run the sentences deduplication script. More about this below.

Sentences deduplication

First of all, note that the deduplication script will not be running during the maintenance, but after. The script can run with Tatoeba being available. It is still unsure whether we will run the script next weekend or later. We are still in the phase of debugging the script.

There was a first test of the script of the dev website. It took 9.5 hours to complete. You can help us make sure that the script works well by checking the dev website. Duplicates that were removed can be identified as they were deleted by Horus (it's the current name of the deduplication bot).
If you notice any issue such as sentences that were deleted while they shouldn't have, or information that was not re-linked properly, report the problem to us on the Wall of the real website (not on the dev please) or on our Google group.

Sunday, November 16, 2014

Tatoeba update (November 16th, 2014)

Link to any sentence

This new feature affects only advanced contributors and corpus maintainers. It is now possible to link a sentence to any other sentence, and not just to its indirect translations. You will find an additional icon in the sentence's menu, which looks the same as the "link" icon next to the translations. Clicking on the button opens a textinput where you can indicate the target sentence.
You can enter either the sentence number or copy-paste the sentence URL.
You can also drag-and-drop a sentence's URL into the icon.

Linking and unlinking refreshes all the translations

There were some inconsistencies with the list of indirect translations displayed after linking or unlinking a sentence. This is now fixed. Whenever you link or unlink, you will see the correct list of indirect translations without having to refresh the page.

Contributions logs

The logs design have been reviewed to take into account the various feedbacks. If you do not see any change, try to empty your cache and/or refresh again.
Note that the date is now clickable and will redirect you to the sentence's page. The sentence will be left as a text so that people can copy-paste it - or part of it - more easily.

We won't implement any option to choose between the new and old design, but for those who are very attached to the old design, here's some CSS code that you can use with the Stylish extension.
Our member CK also has a page about using Stylish with Tatoeba, with some code snippet that you can reuse.
I encourage you to learn about CSS and customize the looks to your own taste, not only for the logs but for any other part of Tatoeba. And if you do come up with something that looks a lot better, don't hesitate to share with the rest of the community!

Search fix for sentences translated into the same language

If you ever tried to search from and into the same language (for instance search "fish" from English to English), you may have noticed that the results includes many sentences that do not have any translation - if you wonder, yes, it's possible in Tatoeba to have two sentences of the same language linked to each other.
This kind of search now only returns sentences that do have translations. So searching "fish" from English to English will only return sentences that have at least one direct or indirect translation in English.

Fixed message not submitted after changing UI language

This update also fixes an annoying bug that prevented people to send comments, wall posts, translations, private messages etc. whenever the interface language was changed from a different place than the page you were submitting from. The symptom was a never ending loading icon that replaced the text you wanted to submit, while nothing was actually submitted.

Tuesday, November 11, 2014

Tatoeba update (November 11th, 2014)

Contributions logs

  • The contributions logs have been redesigned. 
  • There is a small additional visual feature: log entries that are obsolete are displayed a bit differently (with a dotted line and grey text), to indicate that there was more modification on the sentence afterwards.
  • The latest contributions page now also includes the list of users who participated in the latest contributions. It is the same list that you would find in the Members page.

New platform for UI translations

We moved to a platform called Transifex to manage our interface translations. Hopefully this will help us build a more cohesive translators team.
For those who were previously translating on Launchpad: we do not use Launchpad anymore. Don't worry, the translations that were made in Launchpad were exported to transifex, so no translation was lost.

If you would like to join the translators team, simply go to this page, click on "Help translate Tatoeba website", create an account, log in and apply to the language(s) in which you'd like to translate. If the language is not listed, you can request it to be added. Once your application is validated, you will be able to submit translations.

Sunday, October 19, 2014

Tatoeba update (October 19, 2014)

Search results sorted by sentence length

Shorter sentences will have higher priority over longer ones in the search results. Even though the length of a sentence does not necessarily imply that it's a better example sentence, this should make the results more relevant overall.

Possibility to comment deleted sentences

The comment form was displayed on deleted sentences, but the comment was not saved after submission. This has been fixed and it is now possible to post comments on deleted sentences.

Script to remove duplicate sentences

This is just a little note that there has been good progress on the deduplication script. We'll hopefully be able to clean up the corpus soon :)

Other fixes

  • Fixed truncation of long URL's containing non Latin characters.
  • Long words or links that exceed their container box are now split into a new lines instead.
  • Fixed a bug where a part of an URL would be converted into a sentence's link.
  • Fixed a bug where some Wall message previews were displayed as empty on the homepage.

Saturday, October 4, 2014

Tatoeba update (October 4th, 2014)

Sphinx 2.1.9

We have upgraded the search engine to Sphinx 2.1.9. This fixes an issue where searching the word "why" would return no result, despite the fact that many sentences in the database use this word.

New sentences quickly available to search

You will no more have to wait weeks before you can find, through the search, a sentence that you have added. We know that many people have been wondering how come they cannot find a sentence that they have recently added, and the reason, in short, is because sentences need to be indexed before you can find them through the search. We couldn't index too often, because it would take too long, and too much resources.
But with the new server, and with gillux's work on implementing a "delta index", we can now provide search results that are much more up-to-date. New sentences will appear in the search results within an hour or less.

Sentences of a user visible to everyone

We have fixed a bug where the page listing the sentence of a specific user was only accessible to logged in users. The page is now visible to any user, logged in or not.

Saturday, September 27, 2014

Tatoeba update (September 27th, 2014)

Improvement of the search feature
  • The priority in which sentences are displayed in a search result has been improved: sentences with an owner will be displayed before sentences without any owner, and unapproved sentences (whether they have an owner or not) will be displayed last.
  • Uppercase letters with diacritics are now properly assimilated to their lowercase version in a search. The problem was that searching for instance "ça va" would not return sentences containing "Ça va" (with the ç in uppercase). This should now be fixed.

Furigana and romaji
  • Furigana for Japanese sentences is now displayed properly (this has been fixed last week). It was previously displayed with katakana and was displayed on all the words. It is now displayed with hiragana, and only on words with kanjis.
  • The tool to convert Japanese text into romaji now displays romaji properly. It was previously displaying the output in katakana instead of latin letters.

Other fixes and changes
  • The random feature has been fixed for the following languages: Amharic, Cherokee, Lao, Mon, Sinhala, Tamil, Telugu, Tibetan.
  • References to a sentence number are now converted properly into a link if they are the first word of the message.
  • The "translate" button has been disabled on unapproved sentences.
  • An option was added in the settings to remember or not the last list selected. By default it is disabled.

Friday, September 12, 2014

Tatoeba update (September 12th, 2014)

What's new
  • New look for the comments form. Clear your cache or refresh the page again if your form looks strange.
  • New data available in export files: sentences with audio.
  • The anchor in links to comments is back, so that you get directly jump to the right comment when you click on the "#" link.
  • The confirmation popup when deleting a comment is back as well.
  • The text for downloading lists has been reviewed to be clearer.
  • The text on the page "Improve sentences" was made translatable. It will take a bit more time until the strings appear in Launchpad for translation.

Wednesday, September 3, 2014

Tatoeba update (September 3rd, 2014)

New design for messages
  • The main change in this update is the new design for comments on sentences, messages on the Wall, and private messages.
  • You will now also be able to use the sentence URL syntax everywhere, and not only for comments on sentences. The syntax doesn't require brackets anymore (you can simply type #123 instead of [#123]). If you were not aware of this feature: for instance typing #123 in your message will be displayed as a link to the corresponding sentence.
  • The button to send a private messages to the author of a message is now present on all the messages, and not just for comments on sentences.
Other fixes
  • The tags count is updated properly. It was previously not incremented/decremented when adding a tag to a sentence.
  • The sentences count is updated properly. Same issues as with tags, the count was not incremented/decremented properly when adding a sentence.

Sunday, August 24, 2014

Tatoeba update (August 24, 2014)

Upgrade to CakePHP 1.3 and other small fixes
  • We have upgraded to CakePHP 1.3. This will not have any visible impact for users. The next step would be to upgrade to version 2.x, but there is no clear plan for it at the moment.
  • We have changed the links to the wiki to make it less confusing for non-English users. The links currently point to the English version only, because too much of the wiki content is untranslated. When the links were not forced to English, non-English users would be redirected to the non-English version of the page and find an empty page or a page requesting them to log in, because the page was not translated or didn't exist.
  • We have fixed various graphical bugs.

Donations news

We recently receive a very high donation of 1939€. The donator wished to remain anonymous. We also received another non-negligeable 100€, thank you Ray!

In total we have now received 2334€ in donations, which means Tatoeba can afford to pay for a dedicated server for the next few years. We will therefore much less likely have all issues of slowness and unavailabilities that you may have experienced in the past.
With these donations we can also start considering using platforms such as Bountysource to get things done faster.

Saturday, August 16, 2014

Tatoeba update (August 16, 2014)

New downloads URL's

We have changed the URL of the downloads files, containing the data that we redistribute. The files are also now compressed. The old URL is still available for the time being, but will no more contain the latest data.

International targeting

We've included the necessary HTML tags for Google to display the results in the relevant language, and not systematically in English.
On a related topic, we still (and will always) need people to help us translate Tatoeba's interface into other languages. If you would like to help, check out the instructions here.

Donation news

We'd like to thank our two latest donators, Dmitriy and Aleksandr. We've had 8 donations so far, that amount to a total of 295€. The top donation was 100€.

Friday, August 8, 2014

Tatoeba update (August 8, 2014)

Better language selector

We know that the list of languages has gotten pretty long, and it can be unpractical to select a language, so we're introducing a better language selector. The new selector has a search field and you will be able to filter the list to show only languages that match the characters that you have entered.
This feature is only available for registered members at the moment. It requires you to activate it from your settings (Options > Advanced language selector). We did not want to make it globally available since (not yet at least) we know that it may not work on tablets. If you have a tablet, please let us know if/how it works for you.

Other bug fixes
  • We fixed a bug where the logs did not record the user who added a sentence/translation.
  • We fixed the pagination for the "contributions" page when a language is specified.

We hope that you've been enjoying hanging on Tatoeba now that it's hosted on a new server :)  We received 3 donations since the migration so I'd like to thank (again) William, Gary and Shayne for their donations.

Saturday, August 2, 2014

Tatoeba update (August 2, 2014)

Tatoeba is finally back! Again, we are really sorry for the inconvenience that the unavailability of the website may have caused you. Our previous server was getting way to unstable and we had no more choice but to move to a new one.

With this migration, we have included a few fixes and improvements:
  • The interface now remembers the most recent list to which you assigned a sentence and sets that as the default when you want to assign another sentence to a list.
  • Within sentences, runs of multiple spaces, as well as tabs and line breaks, are now condensed into single spaces. This makes the database contents consistent with the display. Non-breaking spaces are unaffected.
  • We now print the search results string correctly for words in languages that are written from right to left.
I'd like to take this occasion to mention again donations. I know that several people have expressed the desire to support Tatoeba financially, so if you are one of them, you will find the necessary information here.
I'd also like to mention as well that this migration would have never been possible without gillux and saeb. Moving Tatoeba to a new server was not an easy task at all. So if you want to thank anyone for restoring the website, you should thank them. The project sure needs more people like them :)

Monday, July 28, 2014

Tatoeba will soon be moving to a new server

We apologize for the difficulty you may have experienced recently when trying to reach Tatoeba. We are about to move to a new server in order to address these problems permanently. The migration should be finished within the next two weeks. Thank you for your patience.

Update 1 (2014-07-30)
We have ordered a new server and it was delivered this morning. Gillux and saeb are currently working on it. We can't say for sure yet when Tatoeba will be back up, but we're progressing.

Update 2 (2014-07-31)
We are making progress, but yesterday afternoon we lost connection to our new server while trying to configure it. We have opened a ticket and are now waiting for our host to investigate the issue.

Update 3 (2014-07-31)
We have access again. At the moment, we're still configuring the server.

Update 4 (2014-08-02)
We're done installing everything. Huge thanks to gillux and saeb! We're testing things a bit to make sure that everything works properly, but Tatoeba should be back some time today :)

Sunday, July 20, 2014

Tatoeba update (July 20, 2014)

Small layout change
  • The logo and search bar have been redesigned a little bit. The logo does not have the "" text in it anymore, so there is now more space for the content.
  • The search bar has now an adaptable height. This probably has no visible impact for most people, but for those who use some specific font, the search bar would look a bit broken due to the higher height of these fonts, and that's now fixed.

Clicking on a log entry highlights the corresponding translation
  • On the sentence's page (for instance, you can now click on the logs to highlight the corresponding translation. This allows to identify more easily the text of the translation, since the logs only mention the translation's ID.
Optimization of "latest contributions" in a specific language
  • You may have experienced a "Gateway" error when trying to view the latest contributions in a specific language (for instance The query has been optimized to avoid this error.

That's it for this update.

On a side note, we are aware that Tatoeba has been pretty slow these days. If you are interested in helping us optimize the website, you are always welcome.

Sunday, June 29, 2014

Tatoeba update (June 29, 2014)

  • We now remember the settings you last chose for translating sentences (source language, target language, languages in which to show translations, whether to show only sentences with audio). Previously, we always chose the language in which the user interface was shown as the source language, and we used default settings for the other values.
  • The first two items in the "Not directly translated into" list have been renamed from "None" and "All languages" to "—" and "Any languages", respectively.
  • Language names are now displayed in a more logical order (collation). In languages whose writing system is alphabetic, the new order will group together language names beginning with the same letter, whether or not they are capitalized and whether or not they have a diacritical mark (accent). For other languages, the new order is also more logical.
  • Lists are now identified as "My lists" (editable only by the contributor who is logged in), "Collaborative lists" (editable by anyone), and "Other personal lists" (editable only by a particular contributor who is not the one logged in). This expresses their nature better than "private" and "public", which implied that only "public" lists could be read by others, which was never the case.
  • We fixed a bug where the date was displayed as "Nov 30th -0001, 00:00" when it should be displayed as "date unknown".
  • We now write the correct values for languages to the appropriate table when sentences and translations are imported from a file.
  • We cleaned up the codebase, getting rid of unused files and putting scripts in a more structured arrangement.
  • The website is now HTML 5 compliant. This will enable further improvements.

Sunday, June 22, 2014

Tatoeba update (June 22, 2014)

  • The fields for adding a new sentence are now easier to read and to type into, and clicking outside the field is ignored.
  • Cleaned up the handling of tags as follows:
    • leading and trailing spaces are now automatically trimmed
    • tag names that are too long are now safely truncated
    • tags whose names begin with "@" are not automatically deleted when the number of associated sentences drops to zero
  • Cleaned up names of existing tags as follows:
    • names of utility tags, which indicate sentences that need attention, now all start with "@" ("@check", "@delete", etc.)
    • no other tag names begin with "@"
    • rewrote tag names beginning with "By" so that they all start with "by"
    • removed extra internal spaces
    • fixed some misspelled tags
    • consolidated some other tags that had multiple variants
  • Updated the wiki page on tags.
  • Added link to donation page to top menu.

Wednesday, June 18, 2014

Tatoeba update (June 18th, 2014)

  • There is now a donation page (also read this article).
  • We moved the FAQ to the wiki. The FAQ can now be translated into other languages. If you would like to help translating the FAQ or articles in the wiki in general, contact us. The wiki handles the following languages: Esperanto, French, German, Italian, Japanese, Mandarin Chinese, Polish, Portuguese (BR), Russian, Spanish.
  • We added new items to the user menu (under the user icon, next to the inbox icon). In addition to the existing links that allowed a user to edit his or her profile and settings, the menu now contains links to the user's sentences, comments, messages, and so on.
  • We added a form in the "Translate sentences" page to display sentences in a specific language that are not translated into another specific language. Previously it was only display random sentences.
  • Sentences that are greater than the maximum length allowed (1500 bytes) will now be shortened in a way that prevents characters from being split.

  • We cleaned up an inconsistency regarding sentences in unknown languages in the database and CSV exports. Some entries for sentences in an unknown language had marked the language as an empty string where they should have used a null.
  • Leading and trailing whitespace around sentences are now trimmed.
  • Misspelled tags starting with "@" were removed and their sentences were assigned to the correct tags.

  • Fixed some error messages.
  • Upgrade to Universal Analytics for Google Analytics.


Alright, we have big news: there is now a "donate" page on Tatoeba. The link is still shyly hiding in the footer of the website, because we're not yet making any sort of campain, but it's there :)

What we would need donations for?

On various occasions people have been wondering why doesn't Tatoeba have a donate button somewhere, or why don't we ask for donations. We never really gave an official reply to this, but to make it short: it's because we never really needed it.

I wouldn't say we really need it now (the project is not going to die without donations), but it seems to me that it is not going to grow much more without a financial boost, and that is for two main things: a new a server, and hiring people.

About the server

You may or may not know this, but we never really paid for hosting.

The website currently runs (since April 2010) on a server provided by the FSF France. They host us for free (on the obvious condition that we only run free software on their servers).
Before this, it was hosted for free by an acquaintance of Trang who was paying for a server but didn't use much of it, so he was fine letting some other website run on it. Later on we donated him back some money as a thank you for hosting the project for something like two years.
Even before, it was hosted on a server from Trang's university, again for free.
And even before, it was hosted on the free webspace provided by Sourceforge when you opened a project on their platform. I'm not sure if they still provide this service, maybe Tatoeba's prototype is still accessible somewhere, from some obscure URL.

It's clear however that to fulfill the growing needs of the project, we will have to move to our own dedicated server at some point in a very near future.

We haven't decided on a host yet, but it would probably cost something around 60€/months, and it is something we will do with or without donations. Donations will still be welcome (since this is not a negligeable cost) to cover the expenses and to ensure that Tatoeba will have enough funds to pay for a dedicated server for at least the next few years. For you to get an idea, based on 60€/month, it would cost close to 1500€ to secure 2 years of hosting.

About hiring people

This is more of a long term goal.

The fact is, we do currently have 4 students working on Tatoeba related projects as part of the Google Summer of Code program (3 months during which students are coding on an open source project). You could say they do it as a job more than just a hobby, since they will receive $5500 from Google for completing successfully their projects.

A few years back, sysko had also did a 3 months internship within Tatoeba, as part of his senior year's project. The internship was funded from part of a grant that Tatoeba received from Mozilla. Although, to be honest, I wouldn't count these 3 months as sysko working on Tatoeba as a job, because he would have probably worked just as much on Tatoeba if he had 3 months of vacations instead.

Anyway, trying to get the financial resources through grants or by participating to programs such as Google Summer of Code requires a lot of time. It can also have some limitations. Google Summer of Code takes place only during the summer for instance. What if we just want one person to spend one day a week on maintenance tasks?

Hiring people will definitely cost more than paying for a dedicated server. It depends on the kind of skills and the kind of tasks we would need to be done, but I do not expect Tatoeba to raise enough money through donations to actually have people working on the project as a job anytime soon. I still wanted to mention that if we happen to receive more money than needed to pay for the server, it will be kept in the hope that someday we won't have to rely on volunteer work to get things done.

In short

Tatoeba now accepts donations (woohoo~). We will use the money donated to pay for a new server at first, and if we can raise enough, to pay for people to work on the project.

Sunday, June 8, 2014

Tatoeba update (June 8, 2014)

  • The sentence counts for tags have been corrected.
  • The sentence count for a tag is now properly decremented by 1 rather than 2 when a sentence that contains the tag is deleted.
  • When the sentence count for a tag reaches zero, the tag is automatically deleted. Note, however, that the tag will remain in the autocompletion list until the next time an admin refreshes this list.
  • All sentence lists for tags are now accessible from the list under "Browse all tags". Previously, there were a few tags that when clicked did not bring up the appropriate sentence list.
New interface language
  • Added Marathi as our 20th user interface language. (Thank you for your work, sabretou!)

Monday, May 26, 2014

Google Summer of Code projects for Tatoeba

I have mentioned in a previous blog post that we have students coding for us this summer as part of the Google Summer of Code program. The coding has officially started this week on Monday 19th. For people who are curious, here's more about the projects and the students.

Jake's project - Export to Anki deck

I intend to write a web application that will accept an Anki deck from a user and compare it against the sentence database to find sentences where the user will know exactly one new word. The idea is based on the Input Hypothesis. The user will be able to specify tags or certain words they want in the sentences. Then it will compile these sentences into an .apkg (Anki`s file format for importing decks) and send it to the user.
More details:

Pallav's project - Administrative scripts

As we know, the current site has went through a couple of crashes and instabilities and to handle such issues there is a need for multiple administrators. To ease the task of administrators and to make recovery quick and easy, there is a need for some administrative scripts that automates most of the tasks. Since the current repository lacks proper documentation for the initial set up, it becomes very difficult for a new user to set up a development environment. So the project's main aim is to create scripts that simplifies the set-up task and along with that a few supporting scripts that can perform backup, restore, export, import, etc. easily. Also, in order to make recovery from a crash easy and smooth, an ansible based solution is to be created that could help the production server recover from a crash back to its latest stable version.
Along with these administrative scripts, the project extends to implement/improve some export scripts. The existing scripts do not allow on the fly export of dumps and the dumps also lack some useful information. So the aim is to get over with these short-comings as well.
More details:

Saeb's project - A Complete Python Rewrite of Tatoeba

The current codebase is hard to maintain and work with, not to mention it didn't see any major improvements in the past year or two. A rewrite of the codebase to use a higher level language and framework will make it more maintainable, cut down on development time, and attract more developers. Also, a move towards a graph database or the use of graph algorithms on top of a relational database will greatly reduce server load and enhance page response time. Finally, an API will greatly reduce the complexity of interacting with the website and will allow the growth of an ecosystem of external tools and programs around tatoeba. This project aims to achieve all of the above and demonstrate the power of this design through a full featured crossplatform javascript interface that should run on major desktop platforms, a number of mobile platforms, and in major browsers.
More details:
Wall thread:

Harsh's project - A Mass Import System For Open Texts

A lot of open texts are freely available on the internet. For example, each language has its own folklore which is usually open. Also, sometimes the same open texts are available in different languages too, for example a Shakespeare  play. I am developing a system, which will take these open texts as input and churn out sentences which can be useful for the corpus. If the same text is available in a different language, it tries to pair up sentences which mean the same in different languages, so that language pairs can be generated.
A responsive website will also be created which will show these sentences and wait for some user to validate the translations or edit minor mistakes if necessary. This will save a lot of time, and can generate many sentence pairs which will act as a steroid for the corpus.
More details:

Saturday, May 17, 2014

Tatoeba update (May 17, 2014): unapproved sentences and other changes

With this update, we are introducing functionality to help us manage sentences contributed by users who ignore repeated warnings that their sentences violate copyright laws or that their sentences are faulty. This is intended to be a means of dealing with offenders rather than a comprehensive means of ranking users or their sentences. We need this measure to protect those who use and redistribute our sentences. (See .)

Admins (and eventually corpus maintainers) will be able to mark sentences as untrustworthy, after which they will be displayed in red and excluded from downloads. Admins will also be able to mark users as untrustworthy, which will cause new sentences contributed by those users to be marked as untrustworthy as well. In this first iteration, there is no means of indicating via the user interface that a particular user is untrustworthy in a particular language but can be trusted in other languages.

In general, if you see a copyrighted or faulty sentence, you should leave a comment on it. Corpus maintainers can delete these sentences after a warning period. However, if you see that a user is posting substantial numbers of such sentences and does not respond to either comments or private messages, inform an admin.

We are introducing other functionality as well:

(1) We now allow lists of up to 100 sentences to be downloaded. When an attempt is made to download a longer list, a message is displayed that states this maximum, as well as the length of the given list. If you want your lists to be downloaded, we encourage you to make them no longer than 100 sentences long.

(2) We now show up to seven numbered page boxes at the top of a list, rather than the five that we previously displayed. This makes better use of the space we already have available.

(3) When a login attempt fails, we show a message rather than simply redisplaying the page.

Finally, we have added Gujarati to our list of languages.

Saturday, May 10, 2014

New feature: unapproved sentences

We are soon going to release a new feature, and I would like to take some time to talk about it. First of all, here's what this feature will do:
  • Corpus maintainers will be able to mark a sentence as "unapproved".
  • Admins will be able to change the "level" of a contributor. By default contributors have a level of 0, but admins can set this level to -1 so that any new sentence/translation from these contributors are marked as "unapproved".
  • Unapproved sentences will still be in the database and will still be indexed whenever we run the indexation, but will be displayed in red on the website.
  • Unapproved sentences will however NOT be exported into the CSV file that we distribute.
The goal of this feature is to deal with 2 issues:
  1. Bad quality sentences. We want Tatoeba to become more useful for language learners. The problem is that since everyone can contribute sentences and translations, some contributions are not reliable enough for language learning, but maybe not bad enough that it's clear they should be deleted.
  2. Non CC-BY sentences. It often happens that new contributors copy-paste sentences from other language learning sources. This is a problem because Tatoeba redistributes the sentences under the CC-BY license and the content needs to be CC-BY compliant.
Setting those sentences as "unapproved" allows us to warn users that there is an issue about the sentence and they should use it with extra care. This feature will also allow admins to act more quickly when a contributor is somehow polluting the corpus. Admins can lower the level of a contributor so that all their next contributions will be marked in red. The contributor will notice themselves that their contributions are red as soon as they are saved.

This feature can obviously be tuned a lot more. Ideally we should treat differently the bad quality sentences from the non CC-BY sentences. Ideally we should set a different level for each user for each language instead. Ideally we should also have approved sentences, and we can also have different levels of approved and unapproved sentences. We just don't have the time and resources to implement these things right now, but they are part of the next steps.

Thursday, May 8, 2014

Tatoeba & Google Summer of Code

Many of you probably don't know, but Tatoeba has been accepted as a mentoring organization for this year's Google Summer of Code. It means that we will have students coding for Tatoeba during the summer, and they will be paid by Google to do so. More specifically, we will 4 students working on projects related to Tatoeba, from May 19th to August 18th.

There was a first meeting between the students and their mentors 10 days ago so that everybody gets acquainted. I am organizing now another meeting for students to get to know better the community. AlanF and myself will be there, and anyone else who is interested to join is welcome :)

Date and time of the meeting: Sunday 11th, 14:00 UTC.

To participate, join our IRC Channel:
  • server: freenode
  • channel: tatoeba

Saturday, April 26, 2014

Tatoeba update (April 26, 2014): speed improvements and 7 new languages

The Tatoeba website now runs much faster, thanks to a language list caching improvement by gillux and migration from Apache to nginx by lool0.

Email notifications have been restored.

Display of furigana for Japanese languages has been restored.

In addition, we have 7 new languages:
  • Cherokee
  • Crimean Tatar
  • Chinyanja
  • Kumyk
  • Livonian
  • Navajo
  • Udmurt

Sunday, April 20, 2014

Tatoeba update (April 20th, 2014)

New fonts

We tried using new embedded fonts (Oxygen and Amaranth) as the main default fonts for Tatoeba, but after the demand of several users, the font was reverted back to Trebuchet MS.
Oxygen and Amaranth are closer to the look and feel that I wish for Tatoeba, but the limitation to Latin letters for these fonts creates a discrepancy that is indeed problematic, since Tatoeba is a linguistic platform and needs to handle uniformly a wide variety of characters.
I wouldn't have a proper reply to this issue so I've reverted back to Trebuchet MS. The website however uses only 1 font overall. For people who enjoyed the previous fonts, this is my Stylish for it (that is on Firefox, you'd have to modify it slightly for Chrome).
Upgrade of CakePHP

Tatoeba is based on the CakePHP framework. We upgraded it from version 1.2.6 to 1.2.12. An upgrade to version 1.3 is on the way.
This is not going to have any significant impact for users right now, but in the longer term, upgrading towards more recent versions of CakePHP can result in better performances (in other words, making the website overall faster).

  • Fixed a bug where admins could not change the status of a user if their interface was not in English.

New languages
  1. Abkhaz
  2. Tetun
  3. Tamil
Renamed languages
  • Teochew -> Min Nan Chinese
  • Wallon -> Walloon
  • Piemontese -> Piedmontese
  • Standard Tibetan -> Tibetan

Sunday, April 6, 2014

Tatoeba update (April 6, 2014): Chinese converter fixes and 21 new languages

What's new
  • The logo no longer says "beta", nor does our discussion of the list feature.
  • The pinyin converter now works again.
  • The Chinese traditional/simplified converter now works again.
  • Updated icons for Kurdish and Telugu.
  • The Spanish user interface is now fully translated.
  • When editing a comment using the German interface, the text for the "Abbruch" link no longer overlaps the "absenden" button.

New languages
  • Bashkir
  • Chuvash
  • Hausa
  • Hawaiian
  • Hill Mari
  • Kinyarwanda
  • Kyrgyz
  • Lakota
  • Luxembourgish
  • Macedonian
  • Mambae
  • Mon
  • Nogai
  • Ottoman Turkish
  • Pipil
  • Shona
  • Shuswap
  • Somali
  • Yakut
  • Yoruba
  • Zulu
These 21 new languages, added to the 146 we had previously, give us a total of 167.

Sunday, March 30, 2014

Tatoeba update (March 30, 2014): 15 new languages

We have just updated the website again. Tatoeba now has 15 new languages, for a total of 146. The new languages are:

- Amharic
- Awadhi
- Bhojpuri
- Chavacano
- Middle English
- Middle French
- Haitian Creole
- Juhuri (Judeo-Tat)
- Greenlandic
- Meadow Mari
- Nahuatl
- Pennsylvania German
- Sinhala
- Turkmen
- Wallon

Thank you to those who gave us the sentences and information to fulfill these requests. Note that the procedure for requesting a new language (which involves supplying at least five sentences in that language) can be found via the Tatoeba menu under "More"/"Tatoeba Wiki"/"How to Request a New Language", or at this link.

Sunday, March 23, 2014

Tatoeba update (March 23, 2014)

We are pleased to announce a set of updates to the site. In addition to the differences that you'll see when you visit the site, we have some major changes behind the scenes that make it easier for us to attract and work with developers around the world.

  • Contributors can now edit their comments on sentences or the Wall.

User Interface
  • Added link to friendlier search instructions.
  • Improved UI text (fixed misspellings, etc.) in English.
  • Incorporated updates to UI translations from the past year or longer, most notably in Japanese and German (which is now 100% translated!).
  • Internationalized several strings so that they can now be translated.
  • Changed remaining references to "" into references to "".
  • Renamed "Modern Greek" to "Greek".

  • Empty passwords are no longer accepted.

  • Now accepts profile photos with uppercase file extensions as well as lowercase.

  • Moved repository from Subversion on Assembla to Git on GitHub.
  • Added scripts for adding languages and incorporating updated translations.
  • Fixed various issues that appeared on developers' machines.

  • All sentences have been indexed, so they will appear in the search results.

Even more important than the changes to the code is the fact that the team behind it is stronger and more responsive than it has been in a long time. We are especially looking forward to working with our Google Summer of Code participants, once we know who they will be.

Whether you are interested in contributing sentences, translating the user interface, developing code, testing the site, or all of the above, we hope you will join the team!

Saturday, March 1, 2014

Why We Need You to Help Beyond Adding Sentences

al_ex_an_der wrote: "I'd find it helpful if you could explain if possible in plain English why a newly added sentence can be found by Google already one minute later but by Tatoeba only one month later." I thought this was worth some discussion in a thread of its own.

First of all, I did a little experiment to determine whether a Google search for a word contained in a sentence that I had added a minute earlier really would succeed. Answer: no, though in one case, it remarkably took only about fifteen minutes before a search ("incontrovertible") found it. But searches for words that I added in sentences seventeen hours and one hour ago came up empty.

To address Alexander's larger point: Why is it that Google indexes words so quickly, and Tatoeba takes so long? It comes down to differences in the hardware, software, human resources, and project management available to Google (a corporation with US$59 billion of revenue in 2013) and Tatoeba (a nonprofit whose budget is somewhat smaller). Google has vast "farms" of machines. Tatoeba has one. Even two machines would be a big improvement because one could index while the other was still actively handling requests and adding sentences. Getting from one to two, however, requires more funding, which demands organization, not just in terms of assembling a proposal for a grant or plans for fundraising, but for putting the money to use if and when it actually comes through. It also requires someone to write the code that can handle interaction between two computers operating in parallel. Software can accomplish what seems like magic, but it's not written by magic. can never hope to replicate the money or machines that has at its disposal, but we can do a far better job (even beyond the impressive things we already do) if we get a lot more participation in everything that makes the site run, beyond the operations of adding, commenting on, modifying, and deleting sentences. Many of the people essential to Google are not software developers, and much of what we're missing at Tatoeba can be provided by people who are not developers, either.

In my last long post, I called for volunteers for testing, either at a high level (putting together a test plan and coordinating other volunteers), or simply working through some screens and determining whether they work. I also asked for someone to coordinate the translators who work on the code at Launchpad. Of course, I would have been glad if someone proposed to help in some way that I didn't even mention. But I was disappointed that no one responded at all. I want you to understand why people stepping up to help are not just nice to have, but essential.

We've undergone some changes in the way we store code, and we need to undergo some changes in the way we put it on the server. If we don't test before and after we make these changes, we could easily break something without knowing that we've broken it. But testing takes time. If I am responsible for doing every level of test planning and testing, as well as planning how to move the code without losing anything, it will take weeks longer to get to the point where we can move it. It will also become likely that something else will change in the interim, so we'll have to begin the cycle again without making any progress.

People who work at Google are motivated by some combination of enjoyment of the tasks involved in their jobs, satisfaction from accomplishing the assignments that they're not initially able to do, and financial incentives for doing their work. Their jobs require them to learn new skills and to do what has to be done, not just what they know they already enjoy doing.

Tatoeba can't provide financial incentives, but we can give you everything else, including the chance to move beyond what you already know you can do to tackle what has to be done (write up a test plan, collect bug reports and enhancement requests from the Wall, fix code written in PHP even if your favorite programming language is Python), and feel proud of what you've accomplished. You can also feel pleased that you're keeping Tatoeba going so that you can continue to add sentences to the corpus.

There is one more reason why we need a coordinated team to connect the gaps: You don't want anyone to burn out because they're asked to do too much. We all have commitments, and are limited to how much time we can contribute. If someone senses that he or she doesn't have enough time to do a job right, they'll drop out entirely. Let's make sure that we take full advantage of the incredibly talented people who've gotten us this far, and those who have yet to join us, by making sure that all the pieces fit together.

Please send me a note telling me how you'd like to help. Many thanks!

Sunday, February 23, 2014

Update on development

I just wanted to give an update on development at Tatoeba. There has been a lot going on behind the scenes.

To begin with, we're rebuilding the team. Developers who were involved in the past but had to take a break are now part of the crew again. Others are learning new skills so that they can help perform new tasks and make life easier for others. Thanks to lool0, we have a new mailing list so that people can communicate via e-mail and continue to refer to our collective wisdom throughout eternity.

In terms of changes that make life better for developers: First, pep has created a virtual machine, which means that regardless of whether developers are running on Windows, Mac, or Linux, they can recreate Tatoeba on their own machines, and can test both how it works now and how their changes affect it. This is a big deal. Secondly, lool0 has moved our repository (the place where we store our code and our problem reports) to GitHub, which is easier for our developers in all countries to reach, and opens us up to collaboration with people who find out about us there. Also thanks to lool0, we have a new mailing list so that people can communicate via e-mail and continue to refer to our collective wisdom throughout eternity. Finally, I've been working with the various translations of the user interface so that the good work already done by the UI translation teams at Launchpad, as well as their translations for all our new features, can be seen live. We are also getting closer to getting new languages and audio onto the site once again. And perhaps it will even become possible to use Tatoeba on a smartphone.

But developers shouldn't have all the fun! I'm hoping that some of you will help with various tasks that don't require you to know how to write code. For instance, I would love for someone to take on the role of translation czar (monarch?), who communicates with the various Launchpad translation teams and sees how we're coming along with making the Tatoeba user interface available in dozens of languages. It would also be great if we could get people involved in testing, whether at the top level (creating a test plan, recruiting and coordinating with other testers), or simply stepping through a list of functions and seeing how well they work. Finally, it would be nice to have a king/queen of collecting audio so that Tatoeba can be seen as well as heard.

Please send me a private message at Tatoeba (or e-mail me at alanf . tatoeba AT gmail) if you're interested in getting involved. And if you want to be part of the development team, visit (and join) the mailing list at!forum/tatoebaproject . I look forward to hearing from you!

Sunday, January 12, 2014

Hello, team!

Hello, everyone! Trang asked me to write this post to let you know that I'm going to be coordinating between the administrators, the developers and the contributors who form the Tatoeba community. As she explained in her previous post, she and sysko are very busy now with other responsibilities. Thus, they can't be involved in all the same ways, and to the same extent, as they were in the past. However, she will maintain an active advisory role in which she promptly answers the questions that I pass to her from the developers. In turn, I will make sure to relay those answers quickly back to the developers. In addition, as I've been doing for months now, I will make sure that the problems and requests posted by contributors on the Wall make their way into help tickets so that they can be tracked and solved in an organized way.

I will also be working with the developers on ways to make the site robust. We will be documenting and spreading the collected knowledge that prevents problems and that helps us recover from them quickly when they do happen. At the same time, we will be planning how to restart development.

I will soon be contacting people who have contributed to development, or have helped us get back on our feet after problems have occurred, or both. But just as importantly, I urge you to contact me if you can help with the technology on the site, whether or not you have done so in the past. You can always send me a private message via Tatoeba to tell me you would like to get involved, along with ways that you can be contacted and a description of your skills, interests, and experience. (There will be other ways you can contact me in the future.) I'll let you know how the current software on the site is set up and how you can jump in.

I'm excited about our getting together to make this site, which I love so much, stronger and better in every way. Join us in making 2014 a happy new year for Tatoeba!

Saturday, January 11, 2014

We need a better team!

Alright, so we have a problem.

Last month Tatoeba crashed and was then unavailable for more than a week, which is a pretty long time for a website that is used every day by thousands of people. After we managed to get Tatoeba back up, there were still issues with the site being very slow, and there were some additional downtime. And only now, after 3-4 weeks, things are stable again (at least they seem to be). It's a good thing that we got everything to work again, but it's not a good thing that it took so long. Now that I don't have to worry about Tatoeba not responding, or being too slow, I'd like to take some time to talk about the current situation.

I don't want to sound dramatic, but the current situation isn't good. There was a time when Tatoeba was actually growing, as opposed to the past 2 years where things have been stagnant. There was a time when users would report bugs, and they would be fixed within a few days, sometimes within a few hours. There was a time when users would request new features, and they would be implemented and released the next week. And if Tatoeba crashed, we could be working on it before you would even notice it was down. Basically, that was the time when sysko and I were both very involved in the project, and had necessary the time, motivation, energy, and passion to work on it.
But things have changed and now it feels like the project is going to fall apart if we don't do anything. Not right now, but one day. I mean, it's still working, a lot of people still love it, but nobody can maintain it properly anymore, and nobody can make it grow anymore. There are so many things that we could do, that we should do, but neither sysko or myself can (or want to) do it, and it will never be done unless someone else than us is willing to take care of it.

So my priority right now is to make a better team for Tatoeba. We need more people to take on tasks/responsibilities that would usually be done by sysko or myself. I'm talking about things like accessing the server and updating Tatoeba's code to include bug fixes or new features, writing on the blog and on Twitter to keep users informed of what's going on or what we're working on, replying to emails that are sent to, etc.
As far as I'm concerned, I know that I will never be able to dedicate as much time and energy to Tatoeba as I used to, and neither does sysko. But I still want this project to keep growing and be more successful, and I know it's not going to happen if there's only sysko and myself in charge of these "higher responsiblity" kind of tasks.
Of course we wouldn't give such responsibilities to complete strangers and there are some people that are in my "need to talk to" list. But whoever you are, if you're reading this and feel that you would want to participate to this project on a higher level, then contact us and let us know about it!

Now, before I end with this article, there is another topic that I'd like to mention briefly: donations. This has been brought up a few times on the Wall and also in our IRC channel, that we should start a donation campain, or try to raise money through Kickstarter or selling goodies. I will talk more about this in another article but my short answer is, yes, I agree. And this is probably going to be one of the next priorities. But first, we need a better team and hopefully 2014 will be a better year for the Tatoeba.

Happy New Year by the way :)