Ruby Full Text Search Performance: Thinking Sphinx vs Sunspot Solr

Posted by Tejus Parikh on November 5, 2010

Full text search support has come a long way since the early days of Ferret. I’ve been using Ultrasphinx for a few years, and while it runs great, it doesn’t work out of the box with Rails 3. Two search projects that seem to be garnering a lot of support from the community are Thinking Sphinx and Sunspot. Thinking Sphinx is the most logical successor to Ultrasphinx, since both utilize Sphinx as the search server. Sphinx works by reading information out of the database to build the search index. Communication with the Sphinx server occurs by sharing C “objects” over sockets. Sunspot uses Solr, a Java search server built on the Lucene search library. Sunspot communicates with Solr through its REST API, using XML. Although the search engine is written in Java, Sunspot bundles a version of Solr that runs as a standalone server to make deployment just as easy as Thinking Sphinx. Solr is a compelling alternative to Sphinx, since the most scalable Web apps (Facebook, Twitter) use Java behind the UI layer. Solr servers can be clustered and since they manage the index, Sunspot can automatically update the indexes when the model objects change. There’s no need to run a cron job to reindex the data or setup delta indexing like with Thinking Sphinx. However, the impact of XML serialization/deserialization required for communicating with Solr on performance worries me. Processing XML documents is not as fast as unpacking C objects. In order to test this difference, I created a little benchmark to measure the relative impact. The Readme describes the test in more detail along with providing the source and instructions so you can configure it to your needs. To give the test a slightly more realistic scenario, the benchmark was run within my Ubuntu VM while communicating with the search process running on my OSX host. The host box has four cores clocked at 2.66 GHz each. The Ubuntu VM had one core dedicated to it. There was plenty of ram available for both the search engine and the benchmark task processing the search results. Mostly I did this to ensure that Thinking Sphinx wasn’t cheating by using unix sockets for communication. I did 50,000 searches and printed out a timing after every 5000 searches. These are the results:

Runs    Thinking Sphinx       Sunspot

5000              38.49       1611.60

10000             38.54       1648.51

15000             39.06       1614.52

20000             38.86       1583.53

25000             39.78       1613.79

30000             38.83       1595.60

35000             38.34       1571.96

40000             38.06       1631.87

45000             37.57       1603.31

50000             38.23       1634.53

Total            385.80      16109.26

I had expected Thinking Sphinx to be faster, but not 45 times faster. Extrapolating the numbers out, one can run 200,000 searches in the time it takes Solr to run 5,000. This was just a rough test to see the relative difference and is purely based on read performance of a few hundred records. It’s possible that proper tuning could improve performance or frequent re-indexing could degrade Thinking Sphinx’s performance, but it’s hard to see that chasm closing enough for there to be comparable performance when the search index can fit on one machine.

Related Posts:

Tejus Parikh

Tejus is an software developer, now working at large companies. Find out when I write new posts on twitter, via RSS or subscribe to the newsletter: