Document Indexing and Scoring in Lucene and Nutch



















![Search Lucene’s index (step 1) Types of query: ◦ Boolean: [IST 441 Giles] [IST Search Lucene’s index (step 1) Types of query: ◦ Boolean: [IST 441 Giles] [IST](https://slidetodoc.com/presentation_image/3c402ef671272e9a8199e3ef89fe4884/image-20.jpg)





- Slides: 25

Document Indexing and Scoring in Lucene and Nutch IST 441 Spring 2009 Instructor: Dr. C. Lee Giles Presenter: Saurabh Kataria

Outline Architecture of Lucene and Nutch Indexing in Lucene Searching in Lucene’s scoring function 2

Lucene’s Open Architecture Crawling Parsing Indexing Lucene Stop Standard CN/DE/ Analyzer TXT parser WWW IMAP Server Larm PDF HTML DOC TXT … PDF parser indexer Lucene Documents Index indexer HTML parser er h c r a e s Searching Spring 2008 searcher File FS System Crawler 3

Nutch’s architecture Courtesy of Doug Cutting’s presentation slide in WWW 2004 4

Nutch’s architecture Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display. Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. Web DB: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the time each document was last fetched. Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch. 5

Lucene’s index (conceptual) Index Document Field Name Document Value Field Document Field Spring 2008 6

Create a Lucene index (step 1) Create Lucene document and add fields import org. apache. lucene. document. Document; import org. apache. lucene. document. Field; public void create. Doc(String title, String body) { Document doc=new Document( ); doc. add(new Field(“text", “content”, Field. Store. NO, Field. Index. TOKENIZED)); doc. add(new Field(“title", “test”, Field. Store. YES, Field. Index. TOKENIZED)); } Spring 2008 7

Create a Lucene index (step 2) Create an Analyser ◦ Options Whitespace. Analyzer divides text at whitespace Simple. Analyzer divides text at non-letters convert to lower case Stop. Analyzer Simple. Analyzer removes stop words Standard. Analyzer good for most European Languages removes stop words convert to lower case Spring 2008 8

Create a Lucene index (step 2) An example of analyzing a document Spring 2008 9

Create a Lucene index (step 3) Create an index writer, add Lucene document into import java. IOException; import org. apache. lucene. index. Index. Writer; import org. apache. lucene. analysis. standard. Standard. Analyser; the index public void Write. Doc(Document doc, String idx. Path) { try{ Index. Writer writer = new Index. Writer(FSDirectory. get. Directory(“/data/index", true), new Simple. Analyzer(), true); writer. add. Document(doc); writer. close( ); } catch (IOException exp) { System. out. println(“I/O Error!”); } } Spring 2008 10

Luence Index – Behind the Scene v Inverted Index (Inverted File) Doc 1: Penn State Football … Posting id word doc offset 1 football Doc 1 3 Doc 1 67 Doc 2 1 football Doc 2: Football players … State 2 penn Doc 1 1 3 players Doc 2 2 4 state Doc 1 2 Doc 2 13 Spring 2008 Posting Table 11

Posting table is a fast look-up mechanism ◦ Key: word ◦ Value: posting id, satellite data (#df, offset, …) Lucene implements the posting table with Java’s hash table ◦ Objectified from java. util. Hashtable ◦ Hash function depends on the JVM hc 2 = hc 1 * 31 + next. Char Posting table usage ◦ Indexing: insertion (new terms), update (existing terms) ◦ Searching: lookup, and construct document vector Spring 2008 12

Lucene Index Files: Field infos file (. fnm) Format: Fields. Count Field. Name Field. Bits Fields. Count, <Field. Name, Field. Bits> the number of fields in the index the name of the field in a string a byte and an int where the lowest bit of the byte shows whether the field is indexed, and the int is the id of the term 1, <content, 0 x 01> Spring 2008 13

Lucene Index Files: Term Dictionary file (. tis) Term. Count, Term. Infos <Term, Doc. Freq> Term <Prefix. Length, Suffix, Field. Num> This file is sorted by Terms are ordered first lexicographically by the term's field name, and within that lexicographically by the term's text Term. Count the number of terms in the documents Term text prefixes are shared. The Prefix. Length is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the Prefix. Length is two and the suffix is "y". Field. Number the term's field, whose name is stored in the. fnm file Format: 4, <<0, football, 1>, 2> <<0, penn, 1> <<1, layers, 1> <<0, state, 1>, 2> Document Frequency can be obtained from this file. Spring 2008 14

Lucene Index Files: Term Info index (. tii) Format: Index. Term. Count, Index. Interval, Term. Indices <Term. Info, Index. Delta> This contains every Index. Interval th entry from the. tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file. Index. Delta determines the position of this term's Term. Info within the. tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry. 4, <football, 1> <penn, 3><layers, 2> <state, 1> Spring 2008 15

Lucene Index Files: Frequency file (. frq) <Term. Freqs> Term. Freqs Term. Freq Doc. Delta, Freq? Term. Freqs are ordered by term (the term is implicit, from the. tis file). Term. Freq entries are ordered by increasing document number. Doc. Delta determines both the document number and the frequency. In particular, Doc. Delta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a Term. Freqs). When Doc. Delta is odd, the frequency is one. When Doc. Delta is even, the frequency is read as the next Int. Format: For example, the Term. Freqs for a term which occurs once in document seven and three times in document eleven would be the following sequence of Ints: 15, 8, 3 <<2, 2, 3> <5> <3, 3>> Term Frequency can be obtained from this file. Spring 2008 16

Lucene Index Files: Position file (. prx) <Term. Positions> Term. Positions <Positions> Positions <Position. Delta > Term. Positions are ordered by term (the term is implicit, from the. tis file). Positions entries are ordered by increasing document number (the document number is implicit from the. frq file). Position. Delta the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). Format: For example, the Term. Positions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of Ints: 4, 5, 4 <<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>> Spring 2008 17

Query Process in Lucene ime tt tan ons Field info (in Memory) C Con stan t tim e ta nt t ns Co Term Dictionary (Random file access) ime Term Info Index (in Memory) Constant time Query Frequency File (Random file access) Spring 2008 Co nst ant tim e Position File (Random file access) 18

Search Lucene’s index (step 1) Construct an query (automatic) import org. apache. lucene. search. Query; import org. apache. lucene. query. Parser. Query. Parser; import org. apache. lucene. analysis. standard. Standard. Analyser; public void form. Query(String querystring) { Query. Parser qp = new Query. Parser (field, new Standard. Analyser( )); Query query = qp. parse(querystring); } Spring 2008 19
![Search Lucenes index step 1 Types of query Boolean IST 441 Giles IST Search Lucene’s index (step 1) Types of query: ◦ Boolean: [IST 441 Giles] [IST](https://slidetodoc.com/presentation_image/3c402ef671272e9a8199e3ef89fe4884/image-20.jpg)
Search Lucene’s index (step 1) Types of query: ◦ Boolean: [IST 441 Giles] [IST 441 OR Giles] [java AND NOT SUN] ◦ wildcard: [nu? ch] [nutc*] ◦ phrase: [“JAVA TOMCAT”] ◦ proximity: [“lucene nutch” ~10] ◦ fuzzy: [roam~] matches roams and foam ◦ date range ◦… Spring 2008 20

Search Lucene’s index (step 2) Search the index import org. apache. lucene. document. Document; import org. apache. lucene. search. *; import org. apache. lucene. store. *; public void search. Idx(String idx. Path) { Directory fs. Dir=FSDirectory. get. Directory(idx. Path, false); Index. Searcher is=new Index. Searcher(fs. Dir); Hits hits = is. search(query); } Spring 2008 21

Search Lucene’s index (step 3) Display the results for (int i=0; i<hits. length(); i++) { Document doc=hits. doc(i); //show your results System. out. println(“id”+doc. get(id)); } Spring 2008 22

Default Scoring Function Similarity score(Q, D) = coord(Q, D) · query. Norm(Q) · ∑ t in Q ( tf(t in D) · idf(t)2 · t. get. Boost() · norm(D) ) Question: ◦ What type of IR model does Lucene use? factors ◦ term-based factors tf(t in D) : term frequency of term t in document d default implementation idf(t): inverse document frequency of term t in the entire corpus default implementation Spring 2008 23

Default Scoring Function • coord(Q, D) = overlap between Q and D / maximum overlap Maximum overlap is the maximum possible length of overlap between Q and D • query. Norm(Q) = 1/sum of square weight½ sum of square weight = q. get. Boost()2 · ∑ t in Q ( idf(t) · t. get. Boost() )2 If t. get. Boost() = 1, q. get. Boost() = 1 Then, sum of square weight = ∑ t in Q ( idf(t) )2 thus, query. Norm(Q) = 1/(∑ t in Q ( idf(t) )2) ½ • norm(D) = 1/number of terms½ (This is the normalization by the total number of terms in a document. Number of terms is the total number of terms appeared in a document D. ) Spring 2008 24

Example: D 1: hello, please say hello to him. D 2: say goodbye Q: you say hello ◦ coord(Q, D) = overlap between Q and D / maximum overlap coord(Q, D 1) = 2/3, coord(Q, D 2) = 1/2, ◦ query. Norm(Q) = 1/sum of square weight½ sum of square weight = q. get. Boost()2 · ∑ t in Q ( idf(t) · t. get. Boost() )2 t. get. Boost() = 1, q. get. Boost() = 1 sum of square weight = ∑ t in Q ( idf(t) )2 query. Norm(Q) = 1/(0. 59452+12) ½ =0. 8596 ◦ tf(t in d) = frequency½ tf(you, D 1) = 0, tf(say, D 1) = 1, tf(hello, D 1) = 2 ½ =1. 4142 tf(you, D 2) = 0, tf(say, D 2) = 1, tf(hello, D 2) = 0 ◦ idf(t) = ln (N/(nj+1)) + 1 idf(you) = 0, idf(say) = ln(2/(2+1)) + 1 = 0. 5945, idf(hello) = ln(2/(1+1)) +1 = 1 ◦ norm(D) = 1/number of terms½ norm(D 1) = 1/6½ =0. 4082, norm(D 2) = 1/2½ =0. 7071 ◦ Score(Q, D 1) = 2/3*0. 8596*(1*0. 59452+1. 4142*12)*0. 4082=0. 4135 ◦ Score(Q, D 2) = 1/2*0. 8596*(1*0. 59452)*0. 7071=0. 1074 Spring 2008 25