Recuperao de Informao B Cap 06 Text and
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6. 1, 6. 2, 6. 3 November 01, 1999
Introduction n Text u main n form of communicating knowledge. Document u loosely defined, denote a single unit of information. u can be any physical unit Fa file F an email F a Web Page
Introduction n Document u Syntax and structure u Semantics u Information about itself
Introduction n Document Syntax u Implicit, or expressed in a language (e. g, Te. X) u Powerful languages: easier to parse, difficult to convert to other formats. u Open languages are better (interchange) u Semantics of texts in natural language are not easy for a computer to understand u Trend: languages which provides information on structure, format and semantics being readable by human and computers
Introduction n New applications are pushing format such that information can be represented independetly of style. Style: defined by the author, but the reader may decide part of it Style can include treatment of other media
Metadata n “Data about the data” u e. g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. n Descriptive Metadata u Author, source, length u Dublin Core Metadata Element Set n Semantic Metadata u Characterizes the subject matter within the document contents u MEDLINE
Metadata n MARC 100 245 250 260 0020 0074 0012 0052 1 $a. Hagler, Ronald. 14$a. The bibliographic. . . $a 3 rd. Ed. $a. Chicago : $b. ALA, $c 1997
Metadata n Metadata information on Web documents u cataloging, content rating, property rights, digital signatures n New standard: Resource Description Framework u description of Web resources to facilitate automated processing of information u nodes and attched atribute/values pairs n Metadescription of non-textual objects u keyword can be used to search the objects
Metadata n RDF Example <RDF: RDF> <RDF: Description RDF: HREF = “page. html”> <DC: Creator> John Smith </DC: Creator> <DC: Title> John’s Home Page </DC: Title> </RDF: Description> </RDF: RDF>
Metadata n RDF Schema Exemple
Text n Text coding in bits u EBCDIC, F Initially, ASCII 7 bits. Later, 8 bits u Unicode F 16 bits, to accommodate oriental languages
Text n Formats u No single format exists u IR system should retrieve information from different formats u Past: IR systems convert the documents u Today: IR systems use filters
Text n Formats u Formats for document interchange (RTF) u Formats for displaying (PDF, Post. Script) u Formats for encode email (MIME) u Compressed files F uuencode/uudecode, binhex
Text n Information Theory u Amount of information is related to the distribution of symbols in the document. u Entropy: u Definition of entropy depends on the probabilities of each symbol. u Text models are used to obtain those probabilites
Text n Example - Entropy u 001001011011
Text n Example - Entropy u 111111
Text n Modeling Natural Language u Symbols: separate words or belong to words u Symbols are not uniformly distributed F binomial model u Dependency F k-order u We of previous symbols markovian model can take words as symbols
Text n Modeling Natural Language u Words distribution inside documents u Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word u Real data fits better with between 1. 5 and 2. 0
Text n Modeling Natural Language u Example - word distibution (Zipf’s Law) =2 F most frequent word: n=300 F 2 nd most frequent: n=76 F 3 rd most frequent: n=33 F 4 th most frequent: n=19 F V=1000,
Text n Modeling Natural Language u Skewed distribution - stopwords u Distribution of words in the documents F binomial distribution F Poisson distribution
Text n Modeling Natural Language u Number of distinct words u Heaps’ Law: u Set of different words is fixed by a constant, but the limit is too high
Text n Modeling Natural Language u Heaps’ Law example between 10 and 100, is less than 1 F example: n=400000, = 0. 5 Fk • K=25, V=15811 • K=35, V=22135
Text n Modeling Natural Language u Length of the words F defines total space needed for vocabulary u Heaps’ Law: length increases logarithmically with text size. u In practice, a finit-state model is used F space has p=0. 2 F space cannot apear twice subsequently F there are 26 letters
Text n Similarity Models u Distance Function F Should be symmetric and satisfy triangle inequality u Hamming F number Distance of positions that have different characters reverse receive
Text n Similarity Models u Edit (Levenshtein) Distance F minimum number of operations needed to make strings equal survey surgery F superior for modeling syntatic errors F extensions: weights, transpositions, etc
Text n Similarity Models u Longest Common Subsequence (LCS) survey - surgery LCS: surey u Documents: F time lines as symbols (diff in Unix) consuming F similar lines u Fingerprints u Visual tools
Conclusions n n n Text is the main form of communicating knowledge. Documents have syntax, structure and semantics Metadata: information about data Formats of text Modeling Natural Language u Entropy u Distribution n Similarity of symbols
- Slides: 27