New Indices for Text Pat Trees and PAT

  • Slides: 25
Download presentation
New Indices for Text : Pat Trees and PAT Arrays Gaston H. Gonnet Ricardo

New Indices for Text : Pat Trees and PAT Arrays Gaston H. Gonnet Ricardo A. Baeza-Yates Tim Snider 報告者:吳彥欽

報告大綱 n n n Authors introduce Introduction PAT Tree Searching algorithms on the PAT

報告大綱 n n n Authors introduce Introduction PAT Tree Searching algorithms on the PAT Tree PAT Array Summary 2

Authors Introduce n n Gaston H. Gonnet Professor, ETH Zürich, Switzerland, Informatik , Institute

Authors Introduce n n Gaston H. Gonnet Professor, ETH Zürich, Switzerland, Informatik , Institute for Scientific Computation n http: //www. inf. ethz. ch/personal/gonnet/ n Symbolic and algebraic computation, heuristic algorithms Computational Biochemistry algorithms. Development of the Darwin n Text searching and sorting algorithms n system. 3

Text Searching Methods n Lexicographical indices n Clustering techniques n Indices based on hashing

Text Searching Methods n Lexicographical indices n Clustering techniques n Indices based on hashing 4

Traditional Model n n Keywords Problems n n Basic structure is assumed. Keywords extraction

Traditional Model n n Keywords Problems n n Basic structure is assumed. Keywords extraction # of keyword is variable. Queries are restricted to keywords 5

PAT tree n How to build indices ? ? ? n n n Keywords

PAT tree n How to build indices ? ? ? n n n Keywords ? ? ? Full text !!! Why use PAT tree n n No restriction on structure No keyword are used 6

PAT-tree Structure n n n PAT tree is a Patricia tree constructed over all

PAT-tree Structure n n n PAT tree is a Patricia tree constructed over all the possible sistring of a text. Patricia tree sistring 7

Patricia tree n n Binary Digital tree Internal node Example : skip number link

Patricia tree n n Binary Digital tree Internal node Example : skip number link to data 0110010 1001000 0100010 00010111 001011 8

Sistring n n Treat text as a long string Each position in the text

Sistring n n Treat text as a long string Each position in the text corresponds to a Semi-Infinite String Example : 9

Sistring Example n Ex: Text sistring 1 sistring 2 sistring 7 sistring 10 :

Sistring Example n Ex: Text sistring 1 sistring 2 sistring 7 sistring 10 : Today is Thursday, I want to. . : 10

PAT Tree n n n PAT tree is a Patricia tree constructed over all

PAT Tree n n n PAT tree is a Patricia tree constructed over all the possible sistring of a text. PAT tree = Patricia tree + all Sistring of text Example : abbaababa TEXT 123456789…… POSITION 11

Indexing Point n n n Words Searching Phrase Searching Indexing point is application dependent

Indexing Point n n n Words Searching Phrase Searching Indexing point is application dependent 12

Searching Algorithms on the PAT tree n n n Prefix Searching Range Searching Longest

Searching Algorithms on the PAT tree n n n Prefix Searching Range Searching Longest Repetition Searching Proximity Searching Most Significant or Most Frequent Searching Regular Expression Searching 13

Prefix Searching n n n Every node in the same subtree has the same

Prefix Searching n n n Every node in the same subtree has the same prefix. A subtree or A single node or Missed Keep the size of each subtree in the internal node. 14

Proximity Searching n n Build S 1, S 2 in PAT tree Find the

Proximity Searching n n Build S 1, S 2 in PAT tree Find the tallest subtree which contained the S 1 and S 2. Sorted S 1, S 2 by position of the answer. Check the proximity condition 15

Most Significant or Most Frequent Searching n n Searching the biggest subtree Most common

Most Significant or Most Frequent Searching n n Searching the biggest subtree Most common word 16

Regular Expression Searching n n n Convert regular expression into a deterministic finite automation(DFA)

Regular Expression Searching n n n Convert regular expression into a deterministic finite automation(DFA) Convert character DFA into binary DFA PAT tree 17

Improvement n n Efficiency is important. PAT tree drawback n n External node will

Improvement n n Efficiency is important. PAT tree drawback n n External node will use large physical space. # of internal node could be very large. 18

Solution n Mapping the tree onto the disk using supernodes n n Allocate as

Solution n Mapping the tree onto the disk using supernodes n n Allocate as much as possible of the tree in a disk page. Bucking of external nodes n Every subtree with size less than b stores in a bucket. 19

But !!……… n Disk page fullness in the actual experiments close to 80% (using

But !!……… n Disk page fullness in the actual experiments close to 80% (using greedy algorithm). n Each tree page has 10 steps path. 20

PAT Array n n n The size of the Bucket !!! Using suffix array

PAT Array n n n The size of the Bucket !!! Using suffix array in Bucket PAT array example : 21

New Discovery n n PAT array only missed the longest repetition. Prefix searching and

New Discovery n n PAT array only missed the longest repetition. Prefix searching and Range searching can only use PAT array. 22

PAT Array Operation n Build PAT array in memory n n Using paging, avoid

PAT Array Operation n Build PAT array in memory n n Using paging, avoid memory thrashing Merge two PAT array n n O( n 2*log(n 1) ) + O( n 2 ) Split first, then merge. 23

Delayed Reading Paradigm n Sistring. Random disk access n n n Reading sistring Store

Delayed Reading Paradigm n Sistring. Random disk access n n n Reading sistring Store request in the pool, wait for time. Use request to generate more requests 24

Summary n Signature file n n n Inverted file n n Storage is small

Summary n Signature file n n n Inverted file n n Storage is small but searching time is linear. Filtering is needed. Performance is good but storage is huge. PAT tree n ………………… 25