Lucene is a high-performance, scalable information retrieval (IR) library. IR refers to the process of searching for documents, information within documents, or metadata about documents. Lucene lets you add searching capabilities to your applications. It’s a mature, free, open source project implemented in Java, and a project in the Apache Software Foundation, licensed under the liberal Apache Software License.
The explosion of the internet and digital repositories has brought large amounts of information within our reach. With time, the amount of data available has become so vast that we need alternate, more dynamic ways of finding information.The need to quickly locate certain information out of the sea of data isn’t limited
to the internet realm—desktop computers store increasingly more data on multi-terabyte
hard drives.
Now lets take an example directly to getting started with Lucene:
package com.dev.kunal.lucenesimple.starter;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.StringField;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
public class SampleExample
{
private static void addDoc(IndexWriter writer,String title,String isbn){
Document doc = new Document();
doc.add(new StringField("title",title,Store.YES));
doc.add(new StringField("isbn",isbn,Store.YES));
try {
writer.addDocument(doc);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void main(String[] args)
{
//Indexing work
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
Directory directory = new RAMDirectory();
try
{
//Throws three categories of exception.
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
addDoc(indexWriter, "kunal", "1234567");
addDoc(indexWriter, "kumar", "1234568");
addDoc(indexWriter, "kumal", "1234569");
addDoc(indexWriter, "kamal", "1234570");
indexWriter.close();
//Searching work
String queryString = args.length >0? args[0]:"lucene";
Query q = new QueryParser("title", analyzer).parse(queryString);
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs dosc = searcher.search(q, 10);
ScoreDoc[] hits= dosc.scoreDocs;
System.out.println("Found hit count "+ hits.length);
for (int i = 0;i< hits.length;i++){
int id = hits[i].doc;
Document d = searcher.doc(id);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
}
catch (CorruptIndexException e)
{
e.printStackTrace();
}
catch (LockObtainFailedException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Important Classes:
Document:
Once you have the raw content that needs to be indexed, you must translate the content into the units (usually called documents) used by the search engine. The document typically consists of several separately named fields (This is Field class)with values, such as title, body,abstract, author, and url. You’ll have to carefully design how to divide the raw content into documents and fields as well as how to compute the value for each of those fields.Lucene provides an API for building fields and documents, but it doesn’t provide
any logic to build a document because that’s entirely application specific.
StandardAnalyzer:
No search engine indexes text directly: rather, the text must be broken into a series of individual atomic elements called tokens. This is what happens during the Analyze Document step. Each token corresponds roughly to a “word” in the language, and this step determines how the textual fields in the document are divided into a series of tokens and may be removing a stop words.Lucene provides an array of built-in analyzers that give you fine control over this process. It’s also straightforward to build your own analyzer, or create arbitrary analyzer chains combining Lucene’s tokenizers and token filters, to customize how tokens
are created.
IndexWriter and IndexConfig are classes used for indexing.
Searching returns hits in the form of a TopDocs object.
Print details on the search (how many hits were found and time taken)
Note that the TopDocs object contains only references to the underlying documents.
In other words, instead of being loaded immediately upon search, matches are loaded
from the index in a lazy fashion—only when requested with the Index-
Searcher.doc(int) call. That call returns a Document object from which we can then
retrieve individual field values.
When you’re querying a Lucene index, a TopDocs instance, containing an ordered
array of ScoreDoc, is returned. The array is ordered by score by default. Lucene computes
a score (a numeric value of relevance) for each document, given a query. The
ScoreDocs themselves aren’t the actual matching documents, but rather references, via
an integer document ID, to the documents matched.
Hope this initial step will help all to get started with lucene.
Will get back with next level of writeup.
Please suggest improvement.
Comments
Post a Comment