Special Topics in Computer Science The Art of






















- Slides: 22
Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh www. Gelbukh. com
Previous chapter: Conclusions Query operations: q Relevance feedback o Simple, understandable o Needs user attention o Term re-weighting q Local analysis for query expansion o Co-occurrences in the retrieved docs o Usually gives better results than global analysis o Computationally expensive q Global analysis o Worse results. What is good for collection is not for a query o Linguistic methods, dictionaries, ontologies, stemming, . . . 2
Previous chapter: Trends and research topics q Interactive interfaces o Graphical, 2 D or 3 D q Refining global analysis techniques q Application of linguistics methods. Stemming. Ontologies q Local analysis for the Web (now too expensive) q Combine the tree techniques (feedback, local, global) 3
Anatomy of a document. . . q We search for documents q What is a document? 4
Characteristics of a document q Syntax is a device that “plays” the document producing semantics (kind of: presentation) q Like CD drive plays CD to produce music q Knowing Korean + paper w/glyphs meaning 5
. . . Anatomy of a document q Queries are conditions on semantics/presentation, not on (binary? ) data of the document q Thus need to know syntax o Example: search in PS or PDF q How to describe formally? 6
Metadata q Info about the organization of data o Data about the data q Descriptive vs. Semantic metadata o Descriptive: about creation: author, date, . . . o Semantic: about meaning: keywords, subject codes, . . . § Ontologies o Others: who and how to use. E. g. : adult, confident, signature q Standards (many) o Dublin Core Metadata Element Set: 15 fields. Descriptive. o Machine Readable Catalog Record (MARC): bibliographic q WEB – very important o Many projects on Web ontologies. Semantic Web. 7
Text q q Encoding. ASCII-7, 8. UNICODE: oriental Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS Compression. ZIP, ARJ Binary in ASCII: uuencode To predict behavior of tools and systems, need to model text q Entropy: the limit of compression, degree of chaos q Statistics of the letters and words o Very skewed 8
Zipf law 9
Zipf law, etc. q q q i-th most frequent word appears k/i times, = 1. 5 – 2 Mandelbrot form: k/(c+i) 50% of text are few hundred words Most of them are stopwords: the, of, and, a, to, in. . . Not indexed smaller indices q Distribution of words by docs o p, k depend on collection and word 10
Heaps’ law 11
Heaps’ law, etc. q # of distinct words (size of vocabulary) = Kn , o = 0. 4. . . 0. 6 square root q Applies to collections q To WWW q Average length of word. English: o 4. 8. . . 5. 3 letters per word, average by text o 6. . . 7 without stopwords o 8. . . 9 average by vocabulary 12
Similarity between strings q q symmetric; triangle: dist (a, c) dist (a, b) + dist (b, c) Hamming: # of different positions. Also for sets. Soundex: phonetic similarity Levenshtein: min # insertions, deletions, substitutions o dist (survey, surgery) = 2 o A very good measure q Longest common subsequence: survey, surgery surey q Various metrics to compare whole docs o E. g. , consider strings as symbols, or similarity of strings, etc. 13
Markup languages “Our documents do not belong to us but to Bill Gates!” q q Extra textual syntax to describe formatting, structure, . . . Marks are called tags. Initial and ending tags surround the marked text. Standard metalanguage: SGML (Standard Generalized Markup Language) o XML (e. Xtensible), its subset: new metalanguage for Web o HTML is an instance of SGML 14
SGML q Provides rules for defining tags q A document consists of: o Definitions of tags § Document Type Declaration, DTD § Informal comments or an additional description o Text with tags q Tags: <tag>text</tag> q Mostly defines semantics, not printing format o Defined in other languages 15
HTML q 1992; 4. 0: 1997 q Instance of SGML o Exists DTD, usually not used q Also does not define (much of) formatting. Thus: q Cascade Style Sheets (CSS) o define aspects of formatting o can be combined (cascaded) o not well supported by browsers q Does NOT (unlike generic SGML ( too expensive)) o allow to specify new tags o support nesting structures o support validity checks 16
XML (e. Xtensible. . . ) q More flexible than HTML, simpler than SGML q Simplified subset of SGML o Much simpler in implementation q Allows for human- and machine-readable markup o Good for development of Web docs o Allow to do things that now are done with Java scripts q Using DTD is optional, parser can discover tags q Extensible Style sheet Language (like CSS in HTML) o Like macros in a word processor q Extensible Linking Language 17
Uses of XML q Math. ML: Mathematical Markup Language o Not only presentation but also meaning of expressions! q SMIL: Synchronized Multimedia Integration Language o Declarative language to specify positions and timing q Resource Description Format o Metadata for XML Trend: HTML evolutions to model and describe the structure of data, not presentation details 18
Multimedia q Text, sound, images, video q Image formats. BMP. Compression: o GIF. Good for few colors o JPG. Lossy compression. Parametric: can be controlled o TIFF is used for exchange; can contain metadata q Moving images: o MPEG: Moving Pictures Expert Group. Encodes changes q Textual images. Compression. Retrieval: o Metadata, keywords o OCR. Many typos; keyword search should be approximate o Treat as a sequence of images, convert query similarly 19
Taxonomy of Web languages 20
Conclusions q Modeling of text helps predict behavior of systems o Zipf law, Heaps’ law q Describing formally the structure of documents allows to treat a part of their meaning automatically, e. g. , search q Languages to describe document syntax o SGML, too expensive o HTML, too simple o XML, good combination 21
The class of Oct 30 is cancelled Thank you! Till November 6 22