I needed to implement an autosuggest/autocomplete search box for use with Solr. After a little research, I found the new TermsComponent feature in Solr 1.4. To use TermsComponent for suggestions, you need to provide set the prefix and lower bound to the input term and make the lower bound exclusive. Use the terms.fl parameter to set the source field. This means:
- Set terms.lower to the input term
- Set terms.prefix to the input term
- Set terms.lower.incl to false
- Set terms.fl to the name of the source field
Your resulting query should look something like this:
http://localhost:8983/solr/autoSuggest?terms=true&terms.fl=name&terms.lower=py&terms.prefix=py&terms.lower.incl=false&indent=true&wt=json
Note: This assumes you are using the default solrconfig.xml for Solr 1.4
In the example above I used “py” for my input term. You will then get output that looks similar to this:
{
"terms":[
"spell",[
"pyblosxom",16,
"pychm",16,
"pyqt",16,
"python",16]]}
Now that we have TermsComponent setup and working correctly its time to create the autosuggest/autocomplete search box. Since I am not one to reinvent the wheel, I did a quick search and found a jQuery UI plugin for autocomplete. The search frontend I was developing was already using jQuery, so this plugin was a perfect fit.
This autocomplete plugin is not in the current release of jQuery UI so I needed to grab it from their subversion repository. You can find instructions where to get it here.
The plugin supports AJAX calls for the data source. It expects the data source to return each suggestion on it’s own line, for example:
pyblosxom
pychm
pyqt
python
As you saw above, this is not what direct output from Solr looks like. On top of this, it is not a good idea to expose your backend server via your frontend code. Time to write a java servlet.
Unfortunately the java client for Solr, SolrJ, didn’t support TermsComponent yet. I decided to add this support, so please see this post for information on my patch.
Assuming you are using a version of SolrJ with my patch, here is a simple servlet that provides the functionality we need:
protected void doGet(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException {
String q = req.getParameter("q");
String limit = req.getParameter("limit");
PrintWriter writer = res.getWriter();
List<Term> terms = query(q, Integer.parseInt(limit));
if (terms != null) {
for (Term t : terms) {
writer.println(t.getTerm());
}
}
}
And the query method:
private List<Term> query(String q, int limit) {
List<Term> items = null;
CommonsHttpSolrServer server = null;
try {
server = new CommonsHttpSolrServer("http://localhost:8983/solr");
} catch(Exception e) { e.printStackTrace(); }
// escape special characters
SolrQuery query = new SolrQuery();
query.addTermsField("spell");
query.setTerms(true);
query.setTermsLimit(limit);
query.setTermsLower(q);
query.setTermsPrefix(q);
query.setQueryType("/terms");
try {
QueryResponse qr = server.query(query);
TermsResponse resp = qr.getTermsResponse();
items = resp.getTerms("spell");
} catch (SolrServerException e) {
items = null;
}
return items;
}
Now you may be wondering why I used the “q” and “limit” parameters. I use these because this is what the jQuery autocomplete plugin sends to the servlet. “q” is the input term, and “limit” is the max number of suggestions to return.
Now to hook everything together. Insert the following javascript into the head of your search page and replace “#searchbox” with the id of the input box you want to use for autocompletion. Also insert the correct url to your servlet.
$(document).ready(function() {
$("#searchbox").autocomplete({ url: 'completion',
max: 5,
});
});
Update your css file with required jQuery UI css:
/* Autocomplete
----------------------------------*/
.ui-autocomplete {}
.ui-autocomplete-results { overflow: hidden; z-index: 99999; padding: 1px; position: absolute; }
.ui-autocomplete-results ul { width: 100%; list-style-position: outside; list-style: none; padding: 0; margin: 0; }
/* if the width: 100%, a horizontal scrollbar will appear when scroll: true. */
/* !important! if line-height is not set, or is set to a relative unit, scroll will be broken in firefox */
.ui-autocomplete-results li { margin: 0px; padding: 2px 5px; cursor: default; display: block; font: menu; font-size: 12px; line-height: 16px; overflow: hidden; border-collapse: collapse; }
.ui-autocomplete-results li.ui-autocomplete-even { background-color: #fff; }
.ui-autocomplete-results li.ui-autocomplete-odd { background-color: #eee; }
.ui-autocomplete-results li.ui-autocomplete-state-default { background-color: #fff; border: 1px solid #fff; color: #212121; }
.ui-autocomplete-results li.ui-autocomplete-state-active { color: #000; background:#E6E6E6 url(images/ui-bg_glass_75_e6e6e6_1x400.png) repeat-x; border:1px solid #D3D3D3; }
.ui-autocomplete-loading { background: white url('images/ui-anim.basic.16x16.gif') right center no-repeat; }
.ui-autocomplete-over { background-color: #0A246A; color: white; }
Congratulations! You should now have a working Solr-based autocomple search box!

Comment Spam
Within a week of switching to WordPress for my blogging software, I started receiving a lot of comment spam. I found this amazing because I have had a blog for a few years now without any problems. I have had the occasional spam comment, but lately I have been receiving 3-7 of them a day. I know this is very little compared to high-volume sites, but seems like a lot for a small site like mine. For the most part, the Akismet spam plugin WordPress ships with does an amazing job. It has let a few slip by, but that is no big deal.
This whole comment spam problem reminded me of a research paper I read a year or so ago. It was called Defending Against an Internet-based Attack on the Physical World. It was about the threat of using api’s such as Google’s SOAP API to automate filling out request forms for catalogues and other material on thousands of sites to some victim. This would cause the victim’s physical mail to become overloaded and very hard to manage. Imagine 100′s or 1,000′s of pieces of mail being delivered to your house every day. The point of this being that I figure spammers are using a technique similar to this to find WordPress blogs, then spam them automatically.
I decided to see how easy it was. First I went to see if I could sign up for Google’s SOAP API, but I found out that they no longer offer this service. Without this service, it is going to be a lot harder to get this done. Ignoring the whole api problem, I decided to find a search string to find comment pages on WordPress blogs. I was amazed at how easy this was. I just went to a blog using the default WordPress theme and looked for keywords that would always be there. After about a second I came up with this search string:
Typing this into google found over 1,000,000 pages! Clicking a few of these verified that they were infact WordPress comment pages. Now I needed to write a program to automate parsing these links. Without the search api, I was stuck doing it manually. After about an hour I came up with this python script. This script will submit the search string I generated above to google, parse the first 100 results from the page, then submit a search for the next 100 and so on. While testing this script I noticed google started blocking my search, which is a good thing. I found a way around this by using different User-Agent strings and adding some timeouts. Because of this, the script defaults to saving the first 100 links. I have left out the code to fill out the comment forms becuase I feel that piece of code would do more harm than good.
Anyways, I think there is a huge problem with comment spam that needs to be fixed. The fact that so many pages can be found in a single search is amazing. Google blocking querys when it detects a bot is definitely a step in the right direction. The fact that I was able to get around this so easily is not.
Files:
http://www.mattweber.org/files/wp-link-finder.py