New Indices for Text Pat Trees and PAT
- Slides: 25
New Indices for Text : Pat Trees and PAT Arrays Gaston H. Gonnet Ricardo A. Baeza-Yates Tim Snider 報告者:吳彥欽
報告大綱 n n n Authors introduce Introduction PAT Tree Searching algorithms on the PAT Tree PAT Array Summary 2
Authors Introduce n n Gaston H. Gonnet Professor, ETH Zürich, Switzerland, Informatik , Institute for Scientific Computation n http: //www. inf. ethz. ch/personal/gonnet/ n Symbolic and algebraic computation, heuristic algorithms Computational Biochemistry algorithms. Development of the Darwin n Text searching and sorting algorithms n system. 3
Text Searching Methods n Lexicographical indices n Clustering techniques n Indices based on hashing 4
Traditional Model n n Keywords Problems n n Basic structure is assumed. Keywords extraction # of keyword is variable. Queries are restricted to keywords 5
PAT tree n How to build indices ? ? ? n n n Keywords ? ? ? Full text !!! Why use PAT tree n n No restriction on structure No keyword are used 6
PAT-tree Structure n n n PAT tree is a Patricia tree constructed over all the possible sistring of a text. Patricia tree sistring 7
Patricia tree n n Binary Digital tree Internal node Example : skip number link to data 0110010 1001000 0100010 00010111 001011 8
Sistring n n Treat text as a long string Each position in the text corresponds to a Semi-Infinite String Example : 9
Sistring Example n Ex: Text sistring 1 sistring 2 sistring 7 sistring 10 : Today is Thursday, I want to. . : 10
PAT Tree n n n PAT tree is a Patricia tree constructed over all the possible sistring of a text. PAT tree = Patricia tree + all Sistring of text Example : abbaababa TEXT 123456789…… POSITION 11
Indexing Point n n n Words Searching Phrase Searching Indexing point is application dependent 12
Searching Algorithms on the PAT tree n n n Prefix Searching Range Searching Longest Repetition Searching Proximity Searching Most Significant or Most Frequent Searching Regular Expression Searching 13
Prefix Searching n n n Every node in the same subtree has the same prefix. A subtree or A single node or Missed Keep the size of each subtree in the internal node. 14
Proximity Searching n n Build S 1, S 2 in PAT tree Find the tallest subtree which contained the S 1 and S 2. Sorted S 1, S 2 by position of the answer. Check the proximity condition 15
Most Significant or Most Frequent Searching n n Searching the biggest subtree Most common word 16
Regular Expression Searching n n n Convert regular expression into a deterministic finite automation(DFA) Convert character DFA into binary DFA PAT tree 17
Improvement n n Efficiency is important. PAT tree drawback n n External node will use large physical space. # of internal node could be very large. 18
Solution n Mapping the tree onto the disk using supernodes n n Allocate as much as possible of the tree in a disk page. Bucking of external nodes n Every subtree with size less than b stores in a bucket. 19
But !!……… n Disk page fullness in the actual experiments close to 80% (using greedy algorithm). n Each tree page has 10 steps path. 20
PAT Array n n n The size of the Bucket !!! Using suffix array in Bucket PAT array example : 21
New Discovery n n PAT array only missed the longest repetition. Prefix searching and Range searching can only use PAT array. 22
PAT Array Operation n Build PAT array in memory n n Using paging, avoid memory thrashing Merge two PAT array n n O( n 2*log(n 1) ) + O( n 2 ) Split first, then merge. 23
Delayed Reading Paradigm n Sistring. Random disk access n n n Reading sistring Store request in the pool, wait for time. Use request to generate more requests 24
Summary n Signature file n n n Inverted file n n Storage is small but searching time is linear. Filtering is needed. Performance is good but storage is huge. PAT tree n ………………… 25
- Making connections images
- Pat pat seguimiento
- Pat trees
- Kyssande vind analys
- Formuö
- Typiska drag för en novell
- Tack för att ni lyssnade bild
- Ekologiskt fotavtryck
- Varför kallas perioden 1918-1939 för mellankrigstiden?
- En lathund för arbete med kontinuitetshantering
- Adressändring ideell förening
- Tidbok
- Sura för anatom
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Stig kerman
- Debattinlägg mall
- Magnetsjukhus
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon
- Arkimedes princip formel
- Publik sektor
- Presentera för publik crossboss
- Argument för teckenspråk som minoritetsspråk
- Plats för toran ark
- Klassificeringsstruktur för kommunala verksamheter