Wednesday, March 27, 2013

Apache Lucene and Apache Tika

Apache Lucene Core: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Apache Tika: The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Using Tika we will parse any file, and extract text out of it, then input the extracted text to Lucene, which inturn index it and then make it ready for searching.