Senior Software Engineer, Craig Handley, talks about Apache Lucene and using it to filter results.

Filtering using Apache Lucene

Craig Handley -  27 Mar, 2016

The Travel Engineering team here at travel.cloud recently started work on implementing the filtering of hotel search results for a brand new version of our online booking tool. Server-side filtering is a common requirement these days. We serve up paginated search results to our thin client, which makes client-side filtering impossible, so we needed the ability to quickly filter and re-filter the search results on the server by attributes such as price, distance name and start rating. Following discussions between UI and server side developers it was decided rather than adding a series of query parameter arguments to carry the filter criteria, it would be more beneficial to use something that the UI already understood, Lucene query syntax.

Once we’d decided upon the syntax that would be used to carry the filter criteria, the server side development team needed to build some functionality that could parse the Lucene query and reduce the recordset returned by applying the filter criteria.

As the team started looking into ways to parse the Lucene query, Apache Lucene was identified as a possible solution that could not only parse the query, but also deal with filtering the results for us. Apache Lucene is a Java based, open source, information retrieval framework that allows documents to be indexed and searched for in a very efficient way. The real advantage of using Apache Lucene is that it can be run in memory which suits our filtering requirements superbly.

To use Apache Lucene, you simply reference required jars to your build.gradle file:

compile 'org.apache.lucene:lucene-core:4.0.0'
compile 'org.apache.lucene:lucene-analyzers-common:4.0.0'
compile 'org.apache.lucene:lucene-queryparser:4.0.0'

Then the code to implement indexing and filtering is straightforward. A complete filtering example class is included below with comments to highlight any important points. The code indexes and allows filtering on a List of POJO’s shown below:

This is the class that we want to filter by, a simple POJO that contains an id, a name and a price:

public class Pojo {




    private String id;
    private String name;
    private Double price;




    public String getId() {
        return id;
    }




    public void setId(final String id) {
        this.id = id;
    }




    public String getName() {
        return name;
    }




    public void setName(final String name) {
        this.name = name;
    }




    public Double getPrice() {
        return price;
    }




    public void setPrice(final Double price) {
        this.price = price;
    }
}




This is the code that will index the List of Pojo’s and allow filtering to occur:

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.ListIterator;




import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;




public class TestSearchFilter {

The applyFilters() method is what should be called when you have your list of objects that want to be indexed / filtered. You simply pass it the collection that contains the unfiltered list and the Lucene query that you want to apply.

public List applyFilters(final List results, final String filters) throws IOException, ParseException {
        final StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
        Directory index = null;




        try {
            index = new RAMDirectory();
            indexResults(analyzer, index, results);
            filterResults(analyzer, index, results, filters);
        } finally {
            if (index != null) {
                index.close();
            }
        }








        return results;
    }

The indexResults() method is what actually creates the indexed documents to allow the filtering to occur. The code just passes each of the objects in your unfiltered collection to the addDoc() method.

private void indexResults(final StandardAnalyzer analyzer, final Directory index, final List results)
            throws IOException {
        IndexWriter w = null;




        try {
            final IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
            w = new IndexWriter(index, config);




            for (final Pojo pojo : results) {
                addDoc(w, pojo);
            }
        } finally {
            if (w != null) {
                w.close();
            }
        }
    } 

Each of the attributes that you want indexing (as well as the key that you want to index by) appears within the addDoc() method. The only attribute that you actually want to store (using the attribute Field.Store.YES) is the index that you’ll need to retrieve, all other attributes that you want to index (and therefore filter by) should have the attribute Field.Store.NO. If you’ll be filtering against a String value use the StringField object, if you require a numeric value you should use DoubleField.

    private void addDoc(final IndexWriter w, final Pojo result) throws IOException {
        final Document doc = new Document();
        doc.add(new StringField("id", result.getId(), Field.Store.YES));
        doc.add(new TextField("name", result.getName(), Field.Store.NO));
        doc.add(new DoubleField("price", result.getPrice(), Field.Store.NO));




        w.addDocument(doc);
    } 

The filterResults() method applies the Lucene query to your collection of objects. It builds up a list of id’s that match the Lucene query and then simply removes any objects from the unfiltered collection that do not match one of the id’s found.

private List filterResults(final StandardAnalyzer analyzer, final Directory index, final List results,
            final String filters) throws IOException, ParseException {
        DirectoryReader reader = null;




        try {
            final Query q = new PojoQueryParser(Version.LUCENE_40, "name", analyzer).parse(filters);




            reader = DirectoryReader.open(index);
            final IndexSearcher searcher = new IndexSearcher(reader);
            final ScoreDoc[] hits = searcher.search(q, null, results.size()).scoreDocs;
            final List hitsList = new ArrayList();




            for (int i = 0; i < hits.length; i++) {
                hitsList.add(searcher.doc(hits[i].doc).get("id"));
            }




            final ListIterator iter = results.listIterator();




            while (iter.hasNext()) {
                if (!hitsList.contains(iter.next().getId())) {
                    iter.remove();
                }
            }
        } finally {
            if (reader != null) {
                reader.close();
            }
        }




        return results;
    }




The final piece is the user defined implementation of a QueryParser. The default QueryParser only handles String range queries by default, therefore you have to provide any implementation of handling range queries that are not String based. You can see below that we’ve had to handle what happens if a Lucene query is passed in that attempts to filter on the non String attribute ‘price’ e.g. “price:[25.00 TO 150.00]”

class PojoQueryParser extends QueryParser {




        public TravelHotelSearchQueryParser(final Version matchVersion, final String f, final Analyzer a) {
            super(matchVersion, f, a);
        }




        @Override
        protected org.apache.lucene.search.Query getRangeQuery(final String field, final String start,
                final String end, final boolean startInclusive, final boolean endInclusive) throws ParseException {
            final TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field, start, end, startInclusive,
                    endInclusive);




            if ("price".equals(field)) {
                return NumericRangeQuery.newDoubleRange(field, Double.valueOf(start), Double.valueOf(end),
                        startInclusive, endInclusive);
            } else {
                return query;
            }
        }
    }
}



This is a quick overview of how the Travel Engineering team has implemented filtering using Apache Lucene. It’s not a finished product, I’m sure we’ll improve the functionality over time and most likely include in directly within the Cheddar framework, but it’s a very cool way to handle a common requirement with great performance.

Subscribe to this blog

Use of this website and/or subscribing to this blog constitutes acceptance of the travel.cloud Privacy Policy.

Comments