Lucene Near Realtime Search Jason Rutherglen Jake Mannix

What is NRT? • Search on documents nearly as fast as they are indexed

Today? • Users expect to search their data immediately after updating it (Web/Social 2.

NRT in Lucene • Uses core Lucene code to make existing batch indexing nearly

Lucene NRT Patches • • • LUCENE-1314 – Index. Reader. clone LUCENE-1516 – Index.

LUCENE-1314 • Index. Reader. clone is like reopen • However it performs a copy-on-write

LUCENE-1516 • Adds ability to obtain an Index. Reader from Index. Writer • Efficient

Sample IW. get. Reader Code Index. Writer writer; Document doc = new Document(); writer.

LUCENE-1313 • Near Realtime Search • Makes IW. get. Reader faster • New segments

LUCENE-1483 • Searches on fieldcaches at the segment level • Means faster field cache

LUCENE-1526 • Optimize copy-on-write • When we’re doing Index. Reader. clone, we may be

LUCENE-1231 • Column stride fields will make field cache loading faster because data will

Future of Lucene NRT • LUCENE-1292 – Realtime parallel untokenized field index (for tags)

Linked. In Open Source Projects • Bobo – Facet library that counts using custom

Bobo. Browse: facet features • • Multi. Select Runtime-defined facets (query-based, etc) Fast (custom

Zoie: realtime features • No modifications to core lucene • Multiple read/write: RAMDir +

Next Steps • Help work on the patches? https: //issues. apache. org/jira/browse/LUC ENE •

Slides: 17

Download presentation

Lucene Near Realtime Search Jason Rutherglen & Jake Mannix Linked. In 6/3/2009 SOLR/Lucene User’s Group San Francisco

What is NRT? • Search on documents nearly as fast as they are indexed • Delete documents in a way that is immediate and IO efficient • Good for things like Twitter and other apps that require realtime searching (Social 2. 0)

Today? • Users expect to search their data immediately after updating it (Web/Social 2. 0 apps) • Search engines are designed to perform efficient batch indexing (not realtime) • Batch indexing is slow and updates take a while to be searchable

NRT in Lucene • Uses core Lucene code to make existing batch indexing nearly realtime • Required retrofitting of some of the core implementation • Details are hidden • Hopefully really easy for developers to use

Lucene NRT Patches • • • LUCENE-1314 – Index. Reader. clone LUCENE-1516 – Index. Writer. get. Reader LUCENE-1313 – RAMDir in Index. Writer LUCENE-1483 – Fast Field. Cache loading LUCENE-1231 – Column stride fields LUCENE-1526 – Incremental copy-onwrite

LUCENE-1314 • Index. Reader. clone is like reopen • However it performs a copy-on-write of norms and deletes • Used by LUCENE-1516 to keep deletes in RAM (rather than flush them to disk)

LUCENE-1516 • Adds ability to obtain an Index. Reader from Index. Writer • Efficient in ram deletes • Call Index. Writer. get. Reader instead of Index. Reader. reopen • All updating, deletes, roepening, and flushing details hidden from user • Will be in Lucene 2. 9

Sample IW. get. Reader Code Index. Writer writer; Document doc = new Document(); writer. add. Document(doc); Index. Reader reader = writer. get. Reader(); Document same. Doc= reader. document(0); assert doc. equals(same. Doc);

LUCENE-1313 • Near Realtime Search • Makes IW. get. Reader faster • New segments are flushed to Index. Writer internal RAMDirectory • Could increase overall indexing performance because there’s no pause while the ram buffer is being written to disk • Will be in Lucene 2. 9?

LUCENE-1483 • Searches on fieldcaches at the segment level • Means faster field cache loading and more efficient memory usage • Good for realtime because field cache loading is less of a bottleneck, less ram usage • Will be in Lucene 2. 9

LUCENE-1526 • Optimize copy-on-write • When we’re doing Index. Reader. clone, we may be creating a huge new array for a small number of deletes or norms updates • So we need to do incremental copy-onwrite of things like deletes, norms, and field caches (? ) • Lucene 3. 0?

LUCENE-1231 • Column stride fields will make field cache loading faster because data will be loaded sequentially from disk • Today there are potentially two hard drive seeks per field cache value (Term. Enum. next, Term. Docs. next) • Lucene 3. 0?

Future of Lucene NRT • LUCENE-1292 – Realtime parallel untokenized field index (for tags) • Pulsing - Store smaller postings directly in the term dictionary (to avoid seeks) for faster field cache loading • Replication • More benchmarks

Linked. In Open Source Projects • Bobo – Facet library that counts using custom field caches http: //code. google. com/p/bobo-browse/ • Zoie – Realtime search on top of Lucene http: //code. google. com/p/zoie/ • Voldemort – Distributed key-value storage http: //project-voldemort. com/

Bobo. Browse: facet features • • Multi. Select Runtime-defined facets (query-based, etc) Fast (custom field-cache based) Custom facet types: – Hierarchical (/a/b/c) – Range – Multivalued

Zoie: realtime features • No modifications to core lucene • Multiple read/write: RAMDir + FSDir • Index. Reader on (small) RAMDir opened per request: instantly realtime • Index. Reader. Decorator for custom Reader • Transparent Indexing: implement Stream. Data. Provider then inject

Next Steps • Help work on the patches? https: //issues. apache. org/jira/browse/LUC ENE • Linked. In is hiring • Contact: jason. rutherglen@gmail. com or jake. mannix@gmail. com