Category Archives: Search

Exploring LucidWorks Enterprise with SolrCloud

In this blog post I will show how to get a distributed search cluster up and running using LucidWorks Enterprise 2.1 (LWE) and SolrCloud. LucidWorks Enterprise 2.1 is the most recent release as of this post.

Definitions

I want to start with a few definitions of the terms I will be using in this article. The documentation for SolrCloud is confusing and seems to have multiple definitions for the same term. For example, the SolrCloud wiki page defines a shard as a partition of the index and the NewSolrCloudDesign wiki page seems to refer to it as a replica.

For the purpose of this article we will use the following definitions:

  • collection: a search index composed of the total set of documents being searched.
  • shard: a partition of a collection. A shard contains a subset of the documents being searched, a collection is composed of one or more shards.
  • replica: a copy of a collection. If a collection is composed of N shards, a single replica means each of those shards will have one copy.
  • node: a single instance of LucidWorks Enterprise or Solr. A single node can contain multiple collections where each collection has a different data source.

Requirements

The basic requirements for this test setup are:

  • Start with a single node
  • Create an index with two shards and one replica
  • Index some documents that should be split between the two shards
  • Bring a new node online and move one of the shards and replicas to this new node

Step 1: Installation

Download and install LucidWorks Enterprise. For this first node, use all the default settings provided by the installer. Installation instructions can be found on Lucid Imagination’s official documentation.

Step 2: Verify Installation and Stop LWE

If you used all the default settings and options during installation of LucidWorks Enterprise, you should have access to the following:

After verifying you have a working installation, stop LWE. To do this, browser to your installation directory and run:

$LWE/app/bin/stop.sh

Step 3: Bootstrap Zookeeper

Running LucidWorks Enterprise and Solr in a distributed mode requires the use of Apache Zookeeper. Lucid Imagination’s documentation recommends running a separate Zookeeper ensemble for production deployments. That is outside the scope of this article, so we will use Solr’s embedded version of Zookeeper that is intended for development purposes only.

Since this is the first time we are running Zookeeper, we need to “bootstrap” it with LWE’s configuration files. To do this start LWE with the bootstrap flags:

$LWE/app/bin/start.sh -lwe_core_java_opts "-Dbootstrap_conf=true -DzkRun"

This bootstrap process only needs to be done once. In the future, you can start LWE with the zkRun flag:

$LWE/app/bin/start.sh -lwe_core_java_opts "-DzkRun"

Once bootstrapped, you can head over to the Solr cloud admin page (http://localhost:8888/solr/#/cloud) and see if the default LWE configs were uploaded to Zookeeper. Verify that you have configs called collection1 and LucidWorksLogs.

LWE configs in Zookeeper

Step 4: Create A Test Collection

To keep things simple, we are going to use LucidWorks Enterprise’s “collection1″ configuration. This is the out of the box schema and solrconfig settings that ship with LucidWorks Enterprise. In most situations you will need to create a schema specifically for the content you are indexing but this default configuration is fine for our test collection.

According to Lucid Imagination’s documentation, it is not yet possible to create a collection containing multiple shards via their admin interface or REST api. Due to this limitation, we will need to do things manually by using Solr’s Core Admin API.

Update 05/05/12: Mark Miller pointed out that this is a lapse in the LucidWorks Enterprise documentation. You can specify the numShards parameter via the LucidWorks REST api or if using the UI, it will honor the numShards system property. This is nice but does not simplify the steps of this post, see comments below.

I wish I could say this is as easy as executing an api call specifying that we would like to create a new collection with two shards and one replica, but I can’t. In it’s current form everything related to creating shards and replicas needs to be done manually.

The SolrCloud documentation mentions the use of the numShards parameter which I assumed would be used to automatically split new collections. In my testing this was not the case, all it does it create a new Zookeeper entry for a second shard but you still need to manually create a Solr core for that shard using the Core Admin API.

So, now that we know we need to do everything manually, execute the four core admin api calls to create a single collection. The four api calls are:

Create the first primary shard:

curl 'http://localhost:8888/solr/admin/cores?action=CREATE&name=testcollection_shard1_replica1&collection=testcollection&shard=shard1&collection.configName=collection1'

Create a replica of the first shard:

curl 'http://localhost:8888/solr/admin/cores?action=CREATE&name=testcollection_shard1_replica2&collection=testcollection&shard=shard1&collection.configName=collection1'

Create the second primary shard:

curl 'http://localhost:8888/solr/admin/cores?action=CREATE&name=testcollection_shard2_replica1&collection=testcollection&shard=shard2&collection.configName=collection1'

Create a replica of the second shard:

curl 'http://localhost:8888/solr/admin/cores?action=CREATE&name=testcollection_shard2_replica2&collection=testcollection&shard=shard2&collection.configName=collection1'

Alright, now that we have the collection created time to check that everything was successful. Head back to the Solr Cloud interface (http://localhost:8888/solr/#/cloud) and view the clusterstate.json entry.

LWE Zookeeper cluster state

In this json output, you should see the new “testcollection” collection and that it is composed of two shards, “shard1″ and “shard2″. Expanding those shards will show our replicas, “replica1″ and “replica2″ for each shard.

You may be wondering why we are looking at the clusterstate.json file over the nice LWE Admin interface. Well, that is because when you create collections manually via Solr’s Core Admin API they do not show up in LWE. This is bug that I hope is addressed in a future version of LucidWorks Enterprise.

Update 05/05/12: If you start LWE with the numShards parameter and use the GUI/Rest API to create the initial collection, it will show up in the UI.

Uhh Ohh where is my new collection?  Not here.

Step 5: Index Data

I had intended on using the example crawler that ships with LucidWorks Enterprise, but Lucid Imagination states that data sources do not work when running LWE in SolrCloud mode. That and the fact I can’t see my collection in the LWE admin interface in order to assign a data source to my test collection.

So, for quick testing purposes I will resort to using the sample documents that ship with a standard download of Apache Solr. Once you download Solr, browse to the $SOLR_HOME/example/exampledocs directory.

Edit the file post.sh to point to our test collection update handler.

URL=http://localhost:8888/solr/testcollection/update

Save the file and run:

./post.sh *.xml

Now that we have data fed, lets check that it was distributed between the two shards we created and that our replicas contain the same data. Head back to to Solr admin page at http://localhost:8888/solr/#/.

  • click on the first shard “testcollection_shard1_replica1″, you should have 10 documents in this shard
  • click on the second shard “testcollection_shard2_replica1″, you should have 11 documents in this shard
  • check the replicas for each shard, they should have the same counts

At this point, we can start issues some queries against the collection:

Get all documents in the collection:
http://localhost:8888/solr/testcollection/select?q=*:*

Get all documents in the collection belonging to shard1:
http://localhost:8888/solr/testcollection/select?q=*:*&shards=shard1

Get all documents in the collection belonging to shard2:
http://localhost:8888/solr/testcollection/select?q=*:*&shards=shard2

Step 6: Add A New Node

Now time for the fun part, adding a new node. We want to create a new node and have the shards and replicas be split between the two nodes. This is going to be yet another manual process because SolrCloud and LucidWorks Enterprise do not automatically rebalance the shards as new nodes come and go.

To keep things simple, we going to run multiple instances of LWE on the same machine. So, run the LWE installer again but this time do not use the defaults. Select a new installation directory (I will refer to this as $LWE2 below), use port 7777 for Solr, and 7878 for LWE UI. Uncheck the box that starts LWE automatically.

Now we need to start our new instance of LucidWorks Enterprise and connect to our existing Zookeeper instance. To do this you need to set the zkHost parameter to the host and port of your existing Zookeeper instance. Unfortunately, Lucid’s documentation does not specify what port Zookeeper is running on. However, on the SolrCloud wiki page, I found that Zookeeper starts on Solr Port + 1000. In our case Zookeeper should be running on port 9888. Run the following command to start the new instance of LWE:

$LWE2/app/bin/start.sh -lwe_core_java_opts "-DzkHost=localhost:9888"

Now execute the two Solr Core Admin API calls to create our shard and replica on this new node since they are not automatically migrated from the first server.

Create a new replica of shard1

curl 'http://localhost:7777/solr/admin/cores?action=CREATE&name=testcollection_shard1_replica3&collection=testcollection&shard=shard1'

Create a new replica of shard2

curl 'http://localhost:7777/solr/admin/cores?action=CREATE&name=testcollection_shard2_replica3&collection=testcollection&shard=shard2'

At this point, take a look at the cluster state like we did at the end of step 4 above. You should still see our two shards, but each shard should now have three replicas. Two on the first node and one of the new node.

Also take a look at the new node’s admin interface at http://localhost:7777/solr/#/. If you look at the core status for our new shards you should see that our documents were automatically sent over from the first node. Finally something I did not need to do manually!!!

Issuing the same queries from step 5 above against the new node should yeild the same results.

http://localhost:7777/solr/testcollection/select?q=*:*
http://localhost:7777/solr/testcollection/select?q=*:*&shards=shard1
http://localhost:7777/solr/testcollection/select?q=*:*&shards=shard2

Step 7: Delete A Replica

Now that we have a new node in the cluster we can kill the extra shard replicas we had created on the first node. Issue the following Solr Core Admin API commands:

Unload and delete shard1 replica

curl 'http://localhost:8888/solr/admin/cores?action=UNLOAD&core=testcollection_shard1_replica2&deleteIndex=true'

Unload and delete shard2 replica

curl 'http://localhost:8888/solr/admin/cores?action=UNLOAD&core=testcollection_shard2_replica1&deleteIndex=true'

Take a look at the cluster state again and observe that we have finally achieved our desired outcome, a single collection with two shards and a replica.

A sharded collection in LWE

Conclusion:

As you can see it is possible to get LucidWorks Enterprise up and running with SolrCloud but it is not a trivial process. Hopefully future versions of LWE will make this process easier and address some of the bugs I mentioned above. At his point SolrCloud feels half-baked and it’s integration into LucidWorks Enterprise even less. Considering all the LWE features that do not work when running in SolrCloud mode, you would probably be better off running a nightly version of Solr 4.0 which will have the latest SolrCloud patches.

ElasticSearch Mock Solr Plugin

I just released an ElasticSearch plugin that Mocks the Solr interface. With this you can use tools and clients that are meant to talk to Solr with ElasticSearch. Some examples are Nutch, Apache ManifoldCF, SolrJ apps, etc. Currently, indexing and deleting of documents is supported 100% for XML (/update request handler) and JavaBin (/update/javabin request handler). Basic support for the Solr search handler (/select) is also included for the Solr q, start, rows, and fl parameters. The q parameter supports 100% of the lucene query syntax. Both XML and JavaBin response formats are supported.

To use the plugin:

  1. 1. Install
  2. $ES_HOME/bin/plugin install mattweber/elasticsearch-mocksolrplugin/1.0.0
  3. 2. Update your client code to point at ElasticSearch and the /_solr REST endpoint.
    http://localhost:9200/index/type/_solr

    Specifying the index and type is optional and will default to “solr” for index, and “docs” for type.

  4. 3. Use your Solr client as normal.

I have tested the plugin with Nutch and various SolrJ test code. Using Nutch with ElasticSearch is the reason I wrote this plugin. Instead of extending Nutch to support ElasticSearch as an endpoint, I figured it would be much better to support any tool trying to talk to Solr. This plugin should greatly reduce the effort in testing and/or replacing Solr with ElasticSearch. It also opens the doors for using tools that were previously not available to ElasticSearch users.

Source available on GitHub:

https://github.com/mattweber/elasticsearch-mocksolrplugin

Solr For WordPress on GitHub

I put Solr For WordPress on GitHub. This is the latest code for 0.3.0.

https://github.com/mattweber/solr-for-wordpress

Running Specific Solr Unit Tests

Just realized that as of 09/17/09 and revision 816090 of Solr you can now run specific unit tests instead of everything at once. This makes a developers life (mine) much easier because you no longer need to wait for all Solr’s tests to run just to test your particular piece of code.

To run specific testcase:

ant -Dtestcase=<CLASS NAME> junit

To run all tests for a specific package:

ant -Dtestpackage=<PACKAGE NAME> junit

To run all root tests of a specific package:

ant -Dtestpackageroot=<PACKAGE ROOT> junit

Solr for WordPress 0.2.0 Released

I just released Solr for WordPress 0.2.0. This release completely replaces the default WordPress search without any special setup. I have also added i18n support so people can translate it into different languages, integrated it into the default WordPress theme, and added support to enable or disable specific facets. This release should make it much easier for people to get setup and working correctly. As usual, please let me know of any bugs you might find by opening a report at https://bugs.launchpad.net/solr4wordpress.

Download here.

Solr AutoSuggest with TermsComponent and jQuery

I needed to implement an autosuggest/autocomplete search box for use with Solr. After a little research, I found the new TermsComponent feature in Solr 1.4. To use TermsComponent for suggestions, you need to provide set the prefix and lower bound to the input term and make the lower bound exclusive. Use the terms.fl parameter to set the source field. This means:

  • Set terms.lower to the input term
  • Set terms.prefix to the input term
  • Set terms.lower.incl to false
  • Set terms.fl to the name of the source field

Your resulting query should look something like this:

http://localhost:8983/solr/autoSuggest?terms=true&terms.fl=name&terms.lower=py&terms.prefix=py&terms.lower.incl=false&indent=true&wt=json

Note: This assumes you are using the default solrconfig.xml for Solr 1.4

In the example above I used “py” for my input term. You will then get output that looks similar to this:

{
 "terms":[
  "spell",[
	"pyblosxom",16,
	"pychm",16,
	"pyqt",16,
	"python",16]]}

Now that we have TermsComponent setup and working correctly its time to create the autosuggest/autocomplete search box. Since I am not one to reinvent the wheel, I did a quick search and found a jQuery UI plugin for autocomplete. The search frontend I was developing was already using jQuery, so this plugin was a perfect fit.

This autocomplete plugin is not in the current release of jQuery UI so I needed to grab it from their subversion repository. You can find instructions where to get it here.

The plugin supports AJAX calls for the data source. It expects the data source to return each suggestion on it’s own line, for example:

pyblosxom
pychm
pyqt
python

As you saw above, this is not what direct output from Solr looks like. On top of this, it is not a good idea to expose your backend server via your frontend code. Time to write a java servlet.

Unfortunately the java client for Solr, SolrJ, didn’t support TermsComponent yet. I decided to add this support, so please see this post for information on my patch.

Assuming you are using a version of SolrJ with my patch, here is a simple servlet that provides the functionality we need:

protected void doGet(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException {
        String q = req.getParameter("q");
        String limit = req.getParameter("limit");
	PrintWriter writer = res.getWriter();
	List<Term> terms = query(q, Integer.parseInt(limit));

	if (terms != null) {
		for (Term t : terms) {
			writer.println(t.getTerm());
		}
	}
}

And the query method:

private List<Term> query(String q, int limit) {
    List<Term> items = null;
    CommonsHttpSolrServer server = null;

     try {
         server = new CommonsHttpSolrServer("http://localhost:8983/solr");
     } catch(Exception e) { e.printStackTrace(); }

     // escape special characters
     SolrQuery query = new SolrQuery();
     query.addTermsField("spell");
     query.setTerms(true);
     query.setTermsLimit(limit);
     query.setTermsLower(q);
     query.setTermsPrefix(q);
     query.setQueryType("/terms");

     try {
         QueryResponse qr = server.query(query);
         TermsResponse resp = qr.getTermsResponse();
         items = resp.getTerms("spell");
     } catch (SolrServerException e) {
      	items = null;
     }

     return items;
}

Now you may be wondering why I used the “q” and “limit” parameters. I use these because this is what the jQuery autocomplete plugin sends to the servlet. “q” is the input term, and “limit” is the max number of suggestions to return.

Now to hook everything together. Insert the following javascript into the head of your search page and replace “#searchbox” with the id of the input box you want to use for autocompletion. Also insert the correct url to your servlet.

        	$(document).ready(function() {

        		$("#searchbox").autocomplete({ url: 'completion',
        			 max: 5,
        		});
          	});

Update your css file with required jQuery UI css:

/* Autocomplete
----------------------------------*/
.ui-autocomplete {}
.ui-autocomplete-results { overflow: hidden; z-index: 99999; padding: 1px; position: absolute; }
.ui-autocomplete-results ul { width: 100%; list-style-position: outside; list-style: none; padding: 0; margin: 0; } 

/* if  the width: 100%, a horizontal scrollbar will appear when scroll: true. */
/* !important! if line-height is not set, or is set to a relative unit, scroll will be broken in firefox */
.ui-autocomplete-results li { margin: 0px; padding: 2px 5px; cursor: default; display: block; font: menu; font-size: 12px; line-height: 16px; overflow: hidden; border-collapse: collapse; }
.ui-autocomplete-results li.ui-autocomplete-even { background-color: #fff; }
.ui-autocomplete-results li.ui-autocomplete-odd { background-color: #eee; }

.ui-autocomplete-results li.ui-autocomplete-state-default { background-color: #fff; border: 1px solid #fff; color: #212121; }
.ui-autocomplete-results li.ui-autocomplete-state-active { color: #000; background:#E6E6E6 url(images/ui-bg_glass_75_e6e6e6_1x400.png) repeat-x; border:1px solid #D3D3D3; }

.ui-autocomplete-loading { background: white url('images/ui-anim.basic.16x16.gif') right center no-repeat; }
.ui-autocomplete-over { background-color: #0A246A; color: white; }

Congratulations! You should now have a working Solr-based autocomple search box!
Solr AutoCompletion

SolrJ TermsComponent Support

I was working on implementing an auto-complete search box today using Solr 1.4 and the new TermsComponent. TermsComponent is a simple plugin that provides access to Lucene’s term dictionary and is very fast. Being fast and the fact it can hook into a search index makes it perfect for an auto-completion server.

Unfortunately, SolrJ does not support this new functionality yet. Well not officially because you could always parse the raw response object yourself. That is exactly what I was doing until I figured I might as well just add the support to SolrJ. I did, and it was extremely easy.

I added support for TermsComponent parameters and implemented a new TermsComponent response type. The TermsComponent response is parsed into a list of Type objects. The Type object has two methods, getTerm() and getFrequency(). getTerm() returns the suggested term, and getFrequency() returns the frequency of the term appearing in the index.

I have submitted my patch upstream for inclusion into a future version of SolrJ.

Here is the link to the JIRA bug report:
https://issues.apache.org/jira/browse/SOLR-1139

Here is the patch:
https://issues.apache.org/jira/secure/attachment/12407047/SOLR-1139.patch

Solr for WordPress

Solr for WordPress
Solr for WordPress is a WordPress plugin that interacts with an instance of the Solr search engine. With this plugin you can:

  • Index pages and posts
  • Perform advanced queries
  • Enable faceting on fields such as tags, categories, and author
  • Treat the category facet as a taxonomy
  • Add special template tags so you can create your own custom result pages to match your theme
  • Configuration options allow you to select pages to ignore, features to enable/disable, and what type of result information you want output.
  • Hit highlighting
  • Dynamic result teasers

Solr for WordPress requires WordPress 2.7 or greater and an instance of Solr 1.3 or greater. Installation is simple, just extract the plugin in your WordPress plugins folder, activate it, then point it at your Solr instance via the configuration page. From there, you can index all your pages and/or posts and you are ready to perform searches against your WordPress data.

This plugin assumes your Solr schema contains the following fields: id, permalink, title, content, numcomments, categories, categoriessrch, tags, tagssrch, author, type, and text. The facet fields (categories, tags, author, and type) should be string fields. You can make tagssrch and categoriessrch any type you want as they are used for general searching. The plugin is distributed with a Solr schema you can use. I will eventually package up a version of Solr configured specifically for this plugin. Until then, the provided schema will have to do.

Integrating Solr for WordPress into your theme is quite simple as well. The plugin provides two template tags, one for a search box and another for search results. For the search box, use the s4w_search_form() tag. For the search results use the s4w_search_results() tag. These template tags output valid xhtml that you can style with css.

This version of the plugin requires you to create your own search page template then create a search page called “Search” using this template. It also requires you to manually update any search forms to search against the search page you just create (“/search/”) and putting the query parameters in the “qry” parameter. In future versions it will completely replace the standard WordPress search functionality.

By default, facting is enabled for the category, tags, author, and post type. Faceting allows your user to drill down into the search results filtering on values of the particular facet. The category facet can be treated as a taxonomy as well.

UPDATE:
Released Solr for WordPress 0.2.0

Plugin Home: Solr for WordPress
Download: Solr for WordPress 0.1.0
WordPress Hosted Plugin Page: Solr for WordPress

New Design

I have finally decided to update my site. I have been pretty busy with work the last couple years and have been slacking on the site. Well, I got an itch to start working on it again and decided to kick off my renewed interest with a completely new design. The design is a heavily modified version of the Elixir theme by Michael Whalen.

Along with the new design I have integrated Solr search. Solr is an amazing search engine. I wrote a plugin called Solr for WordPress that handles all the integration between Solr and WordPress. I will be writing a post about the plugin soon.