STRING PROCESSING CS 4323 0910 2 YFA Tersedia

  • Slides: 25
Download presentation
STRING PROCESSING CS 4323 / 0910 -2 YFA Tersedia online di http: //www. ittelkom.

STRING PROCESSING CS 4323 / 0910 -2 YFA Tersedia online di http: //www. ittelkom. ac. id/sisfo/yanuar 11 YFA CS 4323 S 1/IT/IR/E 4/0410 Institut Teknologi Telkom http: //www. ittelkom. ac. id/staf/yanuar

Query Languages: the Common Query Language The Common Query Language: a formal language for

Query Languages: the Common Query Language The Common Query Language: a formal language for queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information. Objective: human readable and human writable; intuitive while maintaining the expressiveness of more complex languages. Traditionally, query languages have fallen into two camps: (a) Powerful and expressive languages which are not easily readable nor writable by non-experts (e. g. SQL and XQuery). (b) Simple and intuitive languages not powerful enough to express complex concepts (e. g. CCL or Google's query language). http: //www. ittelkom. ac. id/staf/yanuar

The Common Query Language: Examples Simple queries dinosaur comp. sources. misc "complete dinosaur" "the

The Common Query Language: Examples Simple queries dinosaur comp. sources. misc "complete dinosaur" "the complete dinosaur" "ext->u. generic" "and" Booleans dinosaur or bird dinosaur and bird or dinobird (bird or dinosaur) and (feathers or scales) "feathered dinosaur" and (yixian or jehol) (((a and b) or (c not d) not (e or f and g)) and h not i) or j http: //www. ittelkom. ac. id/staf/yanuar

The Common Query Language: Examples Indexes [fielded searching] title = dinosaur title = ((dinosaur

The Common Query Language: Examples Indexes [fielded searching] title = dinosaur title = ((dinosaur and bird) or dinobird) dc. title = saurischia bath. title="the complete dinosaur" srw. server. Choice=foo srw. result. Set=bar Index-set mapping [definition of fields] >dc="http: //www. loc. gov/srw/index-sets/dc" dc. title=dinosaur and dc. author=farlow http: //www. ittelkom. ac. id/staf/yanuar

The Common Query Language: Examples Proximity The prox operator: prox/relation/distance/unit/ordering Examples: complete prox dinosaur

The Common Query Language: Examples Proximity The prox operator: prox/relation/distance/unit/ordering Examples: complete prox dinosaur (caudal or dorsal) prox vertebra ribs prox//5 chevrons ribs prox//0/sentence chevrons ribs prox/>/0/paragraph chevrons http: //www. ittelkom. ac. id/staf/yanuar [adjacent] [near 5] [same sentence] [not adjacent]

The Common Query Language: Examples Relations year > 1998 title all "complete dinosaur" title

The Common Query Language: Examples Relations year > 1998 title all "complete dinosaur" title any "dinosaur bird reptile" title exact "the complete dinosaur" publication. Year < 1980 number. Of. Wheels <= 3 number. Of. Plates = 18 length. Of. Femur > 2. 4 bio. Mass >= 100 number. Of. Toes <> 3 http: //www. ittelkom. ac. id/staf/yanuar

The Common Query Language: Examples Relation Modifiers title all/stem "complete dinosaur" title any /

The Common Query Language: Examples Relation Modifiers title all/stem "complete dinosaur" title any / relevant "dinosaur bird reptile" title exact/fuzzy "the complete dinosaur" author = /fuzzy tailor The implementations of relevant and fuzzy are not defined by the query language. http: //www. ittelkom. ac. id/staf/yanuar

The Common Query Language: Examples Pattern Matching dinosaur* *sauria man? raptor* "the comp*saur" char*

The Common Query Language: Examples Pattern Matching dinosaur* *sauria man? raptor* "the comp*saur" char* [zero or more characters] [exactly one character] [literal "*"] Word Anchoring title="^the complete dinosaur" [beginning of field] author="bakker^" [end of field] author all "^kernighan ritchie" author any "^kernighan ^ritchie ^thompson" http: //www. ittelkom. ac. id/staf/yanuar

The Common Query Language: Examples A complete example dc. author=(kern* or ritchie) and (bath.

The Common Query Language: Examples A complete example dc. author=(kern* or ritchie) and (bath. title exact "the c programming language" or dc. title=elements prox///4 dc. title=programming) and subject any/relevant "style design analysis" http: //www. ittelkom. ac. id/staf/yanuar

The Common Query Language: Examples A complete example dc. author=(kern* or ritchie) and (bath.

The Common Query Language: Examples A complete example dc. author=(kern* or ritchie) and (bath. title exact "the c programming language" or dc. title=elements prox///4 dc. title=programming) and subject any/relevant "style design analysis" Find records whose author (in the Dublin Core sense) includes either a word beginning kern or the word ritchie, and which have either the exact title (in the sense of the Bath profile) the c programming language or a title containing the words elements and programming not more the four words apart, and whose subject is relevant to one or more of the words style, design or analysis. http: //www. ittelkom. ac. id/staf/yanuar

String Searching: Naive Algorithm Objective: Given a pattern, find any substring of a given

String Searching: Naive Algorithm Objective: Given a pattern, find any substring of a given text that matches the pattern. p m t n pattern to be matched length of pattern p (characters) the text to be searched length of t (characters) The naive algorithm examines the characters of tx in sequence. for j from 1 to n-m+1 if character j of t matches the first character of p (compare following characters of t and p until a complete match or a difference is found) http: //www. ittelkom. ac. id/staf/yanuar

String Searching: Knuth-Morris-Pratt Algorithm Concept: The naive algorithm is modified, so that whenever a

String Searching: Knuth-Morris-Pratt Algorithm Concept: The naive algorithm is modified, so that whenever a partial match is found, it may be possible to advance the character index, j, by more than 1. Example: p = "university" t = "the uniform commercial code. . . " j=5 after partial match continue here To indicate how far to advance the character pointer, p is preprocessed to create a table, which lists how far to advance against a given length of partial match. In the example, j is advanced by the length of the partial match, 3. http: //www. ittelkom. ac. id/staf/yanuar

Signature Files: Sequential Search without Inverted File Inexact filter: A quick test which discards

Signature Files: Sequential Search without Inverted File Inexact filter: A quick test which discards many of the nonqualifying items. Advantages • • • Much faster than full text scanning -- 1 or 2 orders of magnitude Modest space overhead -- 10% to 15% of file Insertion is straightforward Disadvantages • • Sequential searching is no good for very large files Some hits are false hits http: //www. ittelkom. ac. id/staf/yanuar

Signature Files Signature size. Number of bits in a signature, F. Word signature. A

Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical or of all the word signatures in a block of text. http: //www. ittelkom. ac. id/staf/yanuar

Signature Files Example Word Signature free text 001 000 110 000 010 101 001

Signature Files Example Word Signature free text 001 000 110 000 010 101 001 block signature 001 010 111 011 F = 12 bits in a signature m = 4 bits per word D = 2 words per block http: //www. ittelkom. ac. id/staf/yanuar

Signature Files A query term is processed by matching its signature against the block

Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, Fd. Frake, Section 4. 2, page 47 discussed how to minimize Fd. The rest of this chapter discusses enhancements to the basic algorithm. http: //www. ittelkom. ac. id/staf/yanuar

String Matching Find File: Find all files whose name includes the string q. Simple

String Matching Find File: Find all files whose name includes the string q. Simple algorithm: Build an inverted index of all substrings of the file names of the form *f, Example: if the file name is foo. txt, search terms are: foo. txt txt xt t Lexicographic processing allows searching by any q. http: //www. ittelkom. ac. id/staf/yanuar

Search for Substring In some information retrieval applications, any substring can be a search

Search for Substring In some information retrieval applications, any substring can be a search term. Tries, using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents. http: //www. ittelkom. ac. id/staf/yanuar

Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings,

Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents. http: //www. ittelkom. ac. id/staf/yanuar

Tries: Suffix Tree Example: suffix tree for the following words: beginning between bread break

Tries: Suffix Tree Example: suffix tree for the following words: beginning between bread break b e gin null rea tween ning http: //www. ittelkom. ac. id/staf/yanuar d k

Tries: Sistrings A binary example String: Sistrings: 01 100 010 111 1 2 3

Tries: Sistrings A binary example String: Sistrings: 01 100 010 111 1 2 3 4 5 6 7 8 01 100 010 111 11 000 101 11 10 001 011 1 00 100 010 111 01 000 101 11 10 001 011 1 00 010 111 00 101 11 http: //www. ittelkom. ac. id/staf/yanuar

Tries: Lexical Ordering 7 4 8 5 1 6 3 2 00 010 111

Tries: Lexical Ordering 7 4 8 5 1 6 3 2 00 010 111 00 101 11 01 000 101 11 01 100 010 111 10 001 011 1 11 000 101 11 Unique string indicated in blue http: //www. ittelkom. ac. id/staf/yanuar

Trie: Basic Concept 1 0 0 1 1 0 2 0 0 1 7

Trie: Basic Concept 1 0 0 1 1 0 2 0 0 1 7 5 1 0 0 0 4 6 1 8 http: //www. ittelkom. ac. id/staf/yanuar 1 3

Patricia Tree 1 0 2 2 0 1 3 0 0 5 5 0

Patricia Tree 1 0 2 2 0 1 3 0 0 5 5 0 4 1 00 3 10 7 1 4 1 1 0 6 2 1 3 1 8 Single-descendant nodes are eliminated. Nodes have bit number. http: //www. ittelkom. ac. id/staf/yanuar

YFA November 2008 (2 nd Edition), May 2008 http: //www. ittelkom. ac. id/staf/yanuar Adapted

YFA November 2008 (2 nd Edition), May 2008 http: //www. ittelkom. ac. id/staf/yanuar Adapted from cs. cornell. edu http: //www. ittelkom. ac. id/staf/yanuar