Full text search support has come a long way since the early days of Ferret. I’ve been using Ultrasphinx for a few years, and while it runs great, it doesn’t work out of the box with Rails 3. Two search projects that seem to be garnering a lot of support from the community are Thinking Sphinx and Sunspot. Thinking Sphinx is the most logical successor to Ultrasphinx, since both utilize Sphinx as the search server. Sphinx works by reading information out of the database to build the search index. Communication with the Sphinx server occurs by sharing C “objects” over sockets. Sunspot uses Solr, a Java search server built on the Lucene search library. Sunspot communicates with Solr through its REST API, using XML. Although the search engine is written in Java, Sunspot bundles a version of Solr that runs as a standalone server to make deployment just as easy as Thinking Sphinx. Solr is a compelling alternative to Sphinx, since the most scalable Web apps (Facebook, Twitter) use Java behind the UI layer. Solr servers can be clustered and since they manage the index, Sunspot can automatically update the indexes when the model objects change. There’s no need to run a cron job to reindex the data or setup delta indexing like with Thinking Sphinx. However, the impact of XML serialization/deserialization required for communicating with Solr on performance worries me. Processing XML documents is not as fast as unpacking C objects. In order to test this difference, I created a little benchmark to measure the relative impact. The Readme describes the test in more detail along with providing the source and instructions so you can configure it to your needs. To give the test a slightly more realistic scenario, the benchmark was run within my Ubuntu VM while communicating with the search process running on my OSX host. The host box has four cores clocked at 2.66 GHz each. The Ubuntu VM had one core dedicated to it. There was plenty of ram available for both the search engine and the benchmark task processing the search results. Mostly I did this to ensure that Thinking Sphinx wasn’t cheating by using unix sockets for communication. I did 50,000 searches and printed out a timing after every 5000 searches. These are the results:
Runs Thinking Sphinx Sunspot 5000 38.49 1611.60 10000 38.54 1648.51 15000 39.06 1614.52 20000 38.86 1583.53 25000 39.78 1613.79 30000 38.83 1595.60 35000 38.34 1571.96 40000 38.06 1631.87 45000 37.57 1603.31 50000 38.23 1634.53 Total 385.80 16109.26I had expected Thinking Sphinx to be faster, but not 45 times faster. Extrapolating the numbers out, one can run 200,000 searches in the time it takes Solr to run 5,000. This was just a rough test to see the relative difference and is purely based on read performance of a few hundred records. It’s possible that proper tuning could improve performance or frequent re-indexing could degrade Thinking Sphinx’s performance, but it’s hard to see that chasm closing enough for there to be comparable performance when the search index can fit on one machine.
Did you like this? Please share:
The Lost Year: A Failed Experiment to Switch Away From Mac
Fed up with the Apple Keyboard, I bought a ThinkPad, installed Linux, and promptly decided that I hated computers.
Maker's Space, Manager's Space
The Grand Remote Work Experiment: A Retrospective
The COVID-19 pandemic has lead to an unexpected experiment in remote working. What has worked and why?