Sent to you by jeffye via Google Reader:

 
 

via LingPipe Blog by mitzimorris on 2/18/09

This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods. In particular it shows how to process search results without using Hits objects, as this class is scheduled for removal in Lucene 3.0. The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible.

In the best “Hello World” tradition I have written a class called HelloLucene which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene has two methods (besides main): buildIndex and searchIndex, which are called in turn. buildIndex builds an index over these 3 texts; searchIndex takes the args vector and runs a search over the index for each string, printing out results to the terminal.

Here is buildIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
...
public static void buildIndex() throws IOException {
    IndexWriter indexWriter
        = new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                          new StandardAnalyzer(),
                          IndexWriter.MaxFieldLength.LIMITED);
    String[] texts = new String[] {  "hello world",
                                     "hello sailor",
                                     "goodnight moon" };
    for (String text : texts) {
        Document doc = new Document();
        doc.add(new Field("text",text,
                    Field.Store.YES,Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
    }
    indexWriter.close();
}

A Lucene data store is called an Index. It is a collection of Document objects. A Document consists of one or more named Field objects. Documents are indexed on a per-field basis. An IndexWriter object is used to create this index. Let’s go over the call to its constructor method:

new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                new StandardAnalyzer(),
                IndexWriter.MaxFieldLength.LIMITED);

The first argument is the index that the IndexWriter operates on. This example keeps the index on disk, so the FSDirectory class is used to get a handle to this index. The second argument is a Lucene Analyzer. This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene uses a StandardAnalyzer, which is designed to index English language texts. It ignores punctuation, removes common function words such as "if", "and", and “"but", and converts words to lowercase. The third argument determines the amount of text that is indexed. The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters.

for (String text : texts) {
    Document doc = new Document();
    doc.add(new Field("text",text,Field.Store.YES,Field.Index.ANALYZED));
    indexWriter.addDocument(doc);
}
indexWriter.close();

The for loop maps texts into Document objects, which contain a single Field with name "text". The last 2 arguments to the Field constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter's Analyzer. The IndexWriter.addDocument() method adds each document to the index. After all texts have been processed the IndexWriter is closed.

Both indexing and search are operations over Document objects. Searches over the index are specified on a per-field basis, (just like indexing). Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query.

Here is searchIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
...
public static void searchIndex(String[] queryStrings)
    throws IOException, ParseException {
    Searcher searcher
        = new IndexSearcher(FSDirectory.getDirectory("helloLuceneIndex"));
    QueryParser parser = new QueryParser("text",new StandardAnalyzer());
    for (String queryString : queryStrings) {
        System.out.println("nsearching for: " + queryString);
        Query query = parser.parse(queryString);
        TopDocs results = searcher.search(query,10);
        System.out.println("total hits: " + results.totalHits);
        ScoreDoc[] hits = results.scoreDocs;
        for (ScoreDoc hit : hits) {
            Document doc = searcher.doc(hit.doc);
            System.out.printf("%5.3f %sn",
                              hit.score, doc.get("text"));
        }
    }
    searcher.close();
}

Access to a Lucene index (or indexes) is provided by a Searcher object. searchIndex instantiates a IndexSearcher over the directory "helloLuceneIndex". Search happens via a call to the Searcher.search() method. Before calling the search() method the search string must be processed into a Lucene Query. To prepare the query we create a QueryParser and call its parse() method on the search string. The QueryParser is constructed using the same Analyzer as was used for indexing because the QueryParser processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way.

Here are the statements which process the results:

TopDocs results = searcher.search(query,10);
System.out.println("total hits: " + results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
    Document doc = searcher.doc(hit.doc);
    System.out.printf("%5.3f %sn",
                      hit.score, doc.get("text"));
}

The search() method returns a TopDocs object, which has 2 public fields: scoreDocs and totalHits. The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc objects, where a ScoreDoc is itself a simple object consisting of two public fields: doc, the Document id (an int value); and score a float value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex reports the TopDoc.totalHits, then iterates over the TopDoc.scoreDocs array. The ScoreDoc.doc is used to retrieve the Document object from the index, and finally the method Document.get() retrieves the contents of the Field with name "text".

In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar. Either add this to your classpath, or simply put it in the same directory as HelloLucene, and compile with the following command:

javac -cp "lucene-core-2.4.0.jar" HelloLucene.java

Invoke the program with the following command:

java -cp ".;lucene-core-2.4.0.jar" HelloLucene "hello world" hello moon foo

Produces these results:

built indexsearching for: hello world
total hits: 2
1.078 hello world
0.181 hello sailorsearching for: hello
total hits: 2
0.625 hello world
0.625 hello sailorsearching for: moon
total hits: 1
0.878 goodnight moonsearching for: foo
total hits: 0all done

The first query "hello world" matches exactly against our first text, but it also partially matches the text "hello sailor". The second query "hello" matches both texts equally well. The third query "moon" matches against the only document which contains that word. The fourth query "foo" doesn’t match anything in the index.

 
 

Things you can do from here:

 
 

 Leave a Reply

(required)

(required)


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

使用腾讯微博登陆

Protected by WP Anti Spam
   
© 2011 Information Retrieval Blog Suffusion theme by Sayontan Sinha