Numeric Range Queries with Lucene Trie Range Uwe

  • Slides: 10
Download presentation
Numeric Range Queries with Lucene Trie. Range Uwe Schindler Lucene Java Contrib Committer uschindler@apache.

Numeric Range Queries with Lucene Trie. Range Uwe Schindler Lucene Java Contrib Committer uschindler@apache. org PANGAEA® - Publishing Network for Geoscientific & Environmental Data MARUM, Center for Marine Environmental Sciences, Bremen, Germany 1

Problems with actual Range. Queries/-Filters • Classical Range. Query hits Too. Many. Clauses. Exception

Problems with actual Range. Queries/-Filters • Classical Range. Query hits Too. Many. Clauses. Exception on large ranges and is very slow. • Constant. Score. Range. Query is faster, cacheable, but still has to visit a large number of terms. • Both need to enumerate a large number of terms from Term. Enum and then retrieve Term. Docs for each term. • The number of terms to visit grows with number of documents and unique values in index (especially for float/double values) 2

Trie. Range: How it works range 3

Trie. Range: How it works range 3

Supported Data Types • Native data type: long, int (standard Java signed). All “tricks”

Supported Data Types • Native data type: long, int (standard Java signed). All “tricks” like padding are not needed! These types are internally made unsigned, each trie precision is generated by stripping off least significant bits (using precision. Step parameter). Each value is then converted to a sequence of 7 bit ASCII chars, result is prefixed with the number of bits stripped, and indexed as term. Only 7 bits/char are used because of most efficient bit layout in index (8 or more bits would split into two or more bytes when UTF-8 encoded). • double, float: Converter to/from IEEE-754 bit layout that sorts like a signed long/int • Date/Calendar: Convert to UNIX time stamp with e. g. Date. get. Time() • Money/prices: Do not use float/double (rounding), use a long/int representation of Cents 4

Speed • Upper limit on number of terms, independent of index size. This value

Speed • Upper limit on number of terms, independent of index size. This value depends only on precision. Step • Term numbers: 8 bit approx. 400 terms, 4 bit approx. 100 terms, 2 bit approx. 40 terms • Query time: in most cases <100 ms with 500, 000 docs index, 13 trie fields, precision. Step 8 bit 5

How to use (indexing) 6

How to use (indexing) 6

How to use (searching) 7

How to use (searching) 7

Future Developments • Current state: Helper field for lower precision values needed (because of

Future Developments • Current state: Helper field for lower precision values needed (because of sorting). Some ideas for fixing this (see recent discussions on java-dev). • Planned: Nice and more GC-friendly API with more flexibility on indexing: trie. Code. Long() and trie. Code. Int() return Token. Stream that can be indexed into one field with custom options (Solr implements this with a wrapper at the moment). • Move to core, more-userfriendly name (Number. Range. Query, Number. Utils)? 8

Demonstration • www. pangaea. de (main site) • www. wdc-mare. org (displays query time)

Demonstration • www. pangaea. de (main site) • www. wdc-mare. org (displays query time) 9

Thank You! 10

Thank You! 10