One of my projects at work lately has been a searchable index of about 80,000 images, each involving about 20 fields' worth of metadata. It's a Drupal project, so it was pretty easy to set up the appropriate content types, fields, and so forth, but when it came time to set up searching, I made a few regrettable assumptions that cost me a lot of time.

Given the record count, I decided it didn't make much sense to use Drupal's core search functionality; I was under the impression that the core search just grepped through the contents of the node table, and would therefore not perform particularly well. That's regrettable assumption #1. Regrettable assumption #2 is simpler: I didn't think search would ever perform well as long as the index was stored in the database.

As a result, I went on an odyssey of sorts looking for replacement search engines. Some of the contenders:

Apache Solr from Acquia
Apache Solr is a Java-based search indexing platform with a supporting Drupal module, and as it happens, the Drupal support company Acquia provides a hosted Solr service that can be leveraged by subscribers. We do have an Acquia subscription, but unfortunately we also have hundreds of Drupal sites, and the subscription doesn't quite cover that many.
Self-hosted Apache Solr
We've occasionally considered setting up our own Solr instance as a service for our users around campus, but the administrative overhead doesn't really fit our schedules just yet. So again, I moved on.
Search Lucene API
Unlike the two Solr-based options, the Search Lucene API module handles its search indexing via PHP (specifically, via Zend_Search_Lucene). It also has a pretty good selection of helper modules available for things like faceted search, content suggestion, and so forth.

Of the three options, Search Lucene API seemed like the best choice with the least administrative overhead. Over the next couple of weeks I hacked away amid intermittent user support requests, slowly but surely piecing together the necessary components for a killer faceted search system. Once I was ready to try it, I started to import the content. Node by node it arrived, and the search kept on scaling successfully as it went. Pleased as punch, I went home for the evening so that the rest of the records could import.

The next morning my inbox was stuffed to the brim with out-of-memory errors from Drupal cron runs. I checked the search index settings; the system had managed to pull in around 33,000 records, but indexing had ground to a halt. It was so bad that I couldn't even access the index statistics page to tell it to rebuild. And this on a system with 112MB dedicated to PHP.

I was confused. I'd never experienced scaling problems with Zend Framework components before, and I couldn't imagine that Drupal added that much overhead. Not wanting to admit defeat, I posted an issue. Soon, the maintainer politely informed me that Search Lucene API was only intended to scale up to about 10,000 records, and less than that if they were particularly complicated.

It would seem I was hosed. However, I realized that there was one more contender I hadn't quite considered yet:

Drupal core search
Drupal comes with a built-in search module, and it's supported by any number of contributed helper modules providing the functionality it doesn't have on its own (e.g., faceted search).

Despairing of all other hopes, I turned off Search Lucene API and turned on the core search module with the appropriate helpers …and it handled everything without a hiccup!

As it turns out, Drupal's core search is a lot smarter than I'd given it credit for. Yes, it's searching against the database, but not the node table …it has a special search index table that is built up on cron runs, just like the other modules do it. With that in mind, it's no surprise that it's a lot faster than I had expected …plus, it doesn't introduce nearly the same PHP memory overhead as Search Lucene API, because a lot of the heavy lifting is offloaded to the database server (which, in our case, is more than powerful enough).

The moral of the story? Know thy bottlenecks. If I had realized how well Drupal's core search performed I never would have tried to optimize it out of the equation, and I would have saved myself a significant amount of development time. Good to know; lesson learned; hope this helps someone else.

Categories: