Sunday, April 18, 2010

Switching from Lucene to Sphinx

As if migrating to a new server wasn't enough, we also decided to migrate to a new search engine. It was a rather on-the-fly decision, but I must admit, it was fun :D

A little bit of context

Until now we were using a search engine called Lucene. It's written in Java, and the integration of Lucene into Tatoeba is something that was coded three years ago, back when I didn't know how to code and wasn't even sure yet I would pursue a career in computer science.
I was just very lucky that one student in computer science at my university found out about my project and was interested to join me in the task of integrating a search engine, as part of a school project (thank you Fran├žois, if you read me).

The problem is, running Lucene takes a lot of memory. And our new server doesn't have a lot of memory (512MB RAM). So we figured, okay, we'll just leave the search engine on the old server (2GB RAM), Masa (the admin) will not mind.

But Masa wanted to clean up his server, to reinstall it from scratch, but couldn't. He didn't want Tatoeba to be in trouble (because that meant we had to find somewhere else to go, even if it would be temporary). So when I told him we were moving to our own server, he was quite excited, he could finally reinstall peacefully. I told him our migration was scheduled on Saturday April 17th, and that we would find a temporary solution for the search engine, so he can do whatever on Sunday.

Migration day

Saturday, migration day. Lots of things to do. And I couldn't be in Paris with 3 other members of my team (Allan, Robin and Baptiste), so it only made the task harder. I won't go into details, but we reached the end of the day, everything went pretty well, except we hadn't taken care of the search engine yet...

We were in IRC, and Robin and Baptiste had left. I was telling Allan all the hackish stuff we would need to do to set up the search engine, because the initial plan was that we temporarily use his machine at work to host it. But then he felt "Okay this too hackish, I'll try to find another solution otherwise we will never update the search engine".

Except, I had received an email from Masa ealier, telling me he would really like if we could be done migrating by 1AM, so I tell Allan "But Masa really really wants to reinstall his server, we need to have something working by midnight". And it was 8PM...

How we decided to use Sphinx

Allan was not going to give up so easily. He started telling me that he had already done some searches before, and that Sphinx was often mentioned as a competitor of Lucene.
me: Sphinx or Lucene, if you can code me something within 2-3 hours, I have nothing against it.
So he kept going, telling me that Sphinx handles stemming, that it's written in C++, that someone made a behavior to integrate it in CakePHP...
me: Alright, but it will be for next week :P
Allan: So I didn't really have a choice...
me: Ah because you want to do this now?
Allan, quoting me: Sphinx or Lucene, if you can code me something within 2-3 hours, I have nothing against it.
me: Well okay, we can try it.
Allan: Yea because you know, there wasn't any big fail in our migration, so we need to add more pressure, otherwise it's not fun.
me, thinking: Like I didn't have enough pressure for the day *sigh*. (Allan was in the train while *I* was doing the migration)
me: Give me the links you have, I'll see what I can do to speed up the integration.
It was 8:30PM.

How things went

Things went very well :) Note that none of us knew much about Sphinx before. We had no idea how difficult (or how easy) it was to install it, and run it, and integrate it in CakePHP. Allan took care of the installation & configuration part while I was taking care of the integration in CakePHP.

I still had to know how to install it locally though. As a Windows user, I must say this link helped me a lot:

Once I understood how Sphinx worked and how to get it to work (which took me a bit more than one hour), all I had to do was to follow the explanations on the Sphinx Behavior documentation, adapt the code to Tatoeba, figure out how to pass GET variables with CakePHP's Paginator, and add some "warning" message to let users know that we're switching to a new search engine and some features are no more available (but of course we will integrate them back as soon as possible).

In the meantime, Allan installed Sphinx on our new server, figured out how to create one index for each language so that people can still search from a specific language, figured out how to fetch in that index from CakePHP, and figured out how to make the search work for languages that had non ASCII characters.

It was then 1AM, and we had done it. Installed Sphinx, integrated it into CakePHP, have it work for all the languages we are supporting, did the tests to make sure basic searches are working, and updated Tatoeba.

Now everything is soooo fast, it's awesome. Besides, indexing with Sphinx only takes 30-60 seconds (compared to 15-20 minutes with our 3 year-old Lucene code). So we can afford to index much more often.

The whole experience was awesome as well. The challenge, the teamwork, the achievement. I loved it :D

1 comment:

  1. Hello there! You said everything went well with the migration from Lucene to Sphinx. I'm just wondering if you could help me out with a little problem. If I have these codes from my controller:

    $sphinx = array('matchMode' => SPH_MATCH_ALL, 'sortMode' => array(SPH_SORT_EXTENDED => '@relevance DESC'));
    $results = $this->Post->find('all', array('search' => 'it', 'sphinx' => $sphinx));

    How do I pass/display them onto my View (index.ctp) code? Can you give me an example please? Thank you so much.



Note: Only a member of this blog may post a comment.