Content Types Text and Metadata Introduction Text documents

  • Slides: 23
Download presentation
Content Types: Text and Metadata

Content Types: Text and Metadata

Introduction Text documents come in many forms – Article (news, conference, journal, etc. )

Introduction Text documents come in many forms – Article (news, conference, journal, etc. ) – Email, memo, … – Book, manual, manuscript, transcript, … – Any part of one of the above Syntax can express – Structure – Presentation style – Semantics (e. g. software code)

Metadata – data about data Descriptive metadata – External to meaning of document –

Metadata – data about data Descriptive metadata – External to meaning of document – Author, publication date, document source, document length, document genre, file type, bits per second, frame rate, etc. Semantic metadata – Characterizes semantic content of document – Lo. C subject heading, keywords, subject headings from ontologies (e. g. MESH), etc.

Metadata Formats Machine Readable Cataloging Record (MARC) – Used by most libraries – Fields

Metadata Formats Machine Readable Cataloging Record (MARC) – Used by most libraries – Fields include title, author, etc. Resource Description Framework (RDF) – Used for Web resources – Node and attribute / value pairs – Node ID is any Uniform Resource Identifier (URI), which could be a URL

Metadata Sets Dublin Core Metadata Elements 1. Contributor – entities contributing to the content

Metadata Sets Dublin Core Metadata Elements 1. Contributor – entities contributing to the content 2. Coverage – extent or scope of content (spatial area, temporal period, …) 3. Creator – entity primarily responsible for making the content 4. Date – date associated with event (e. g. publication) for resource 5. Description – abstract, table of contents, … 6. Format – media (file) type, dimensions (size, duration), hardware needed 7. Identifier – unique identifier 8. Language – language of content 9. Publisher – entity responsible for making resource available 10. Relation – reference to related resource(s) 11. Rights – information about rights held in/over resource 12. Source – resource from which content is derived 13. Subject – keywords, key phrases, classification code, etc. 14. Title – name of the resource 15. Type – nature or genre of content

Text Formats Coding schemes – EBCDIC (7 bit, one of first coding schemes) –

Text Formats Coding schemes – EBCDIC (7 bit, one of first coding schemes) – ASCII (initially 7 bit, extended to 8 bit) – Unicode (16 bit for large alphabets) Additional Formats – RTF (format-oriented document exchange) – PDF and Post. Script (display-oriented representation) – Multipurpose Internet Mail Exchange (MIME) (multiple character sets, languages, media)

Information Theory How can we predict information value of components of a document? Entropy

Information Theory How can we predict information value of components of a document? Entropy – attempts to model information content (information uncertainty) E = - Sum all symbols in alphabet (pi log 2 pi) pi is the probability of symbol I (symbol frequency over number of symbols) Need a text model for real language Also important for compression as E acts as a limit of how much a text can be compressed.

Modeling Character Strings Symbols in NL are not evenly distributed – Some symbols are

Modeling Character Strings Symbols in NL are not evenly distributed – Some symbols are not part of words (often used for syntax) – Symbols in words are not evenly distributed Models – Binomial model uses distribution of symbols in language • But previous symbols influence probabilities of later symbols • (what letter will appear after a q? ) – Finite context or Markovian models used for this dependency • k-order where k is the number of previous characters taken into account by the model • Thus, the binomial model is a 0 -order model

Word Distribution in Documents How frequent are words within documents? Zipf’s Law – Frequency

Word Distribution in Documents How frequent are words within documents? Zipf’s Law – Frequency of the ith most frequent word is 1/itheta * frequency of most frequent word – The value of theta depends on the text (value of 1 is logarithmic distribution) – Theta values of 1. 5 to 2. 0 best model real texts In practice, a few hundred words make up 50% of most texts – Frequent words provide less information – Thus, many search strategies involve ignoring stopwords (a, an, the, is, of, by, …)

Word Distribution in Collections Simplest to assume uniform distribution of words in documents –

Word Distribution in Collections Simplest to assume uniform distribution of words in documents – But not true Better models built on negative binomial distributions or Poisson distributions

Vocabulary Size for Documents and Collections Heap’s Law – Vocabulary size (V) grows with

Vocabulary Size for Documents and Collections Heap’s Law – Vocabulary size (V) grows with number of words (n) • V = Knb • Experimentally, – K is between 10 and 100 – B is between 0. 4 and 0. 6 – So vocabulary grows proportionally with the square root of the size of the document or collection in words – Works best for large documents & collections

String Similarity Models Similarity is measured by a distance function Hamming distance – number

String Similarity Models Similarity is measured by a distance function Hamming distance – number of characters different in strings Levenshtein distance – minimum number of insertions, deletions, and substitutions needed to make strings equal – color to colour is 1 – survey to surgery is 2 Can be extended to documents – UNIX diff treats each line as a character

Content Types: Markup and Multimedia

Content Types: Markup and Multimedia

Introduction Markup languages use extra textual syntax to encode: – Formatting / display information

Introduction Markup languages use extra textual syntax to encode: – Formatting / display information – Structure information – Descriptive metadata – Semantic metadata Marks are often called tags – The act of adding markup is called tagging – Most markup languages use initial and ending tags surrounding the marked text

Standard Generalized Markup Language (SGML) Metalanguage for markup. – Includes rules for defining markup

Standard Generalized Markup Language (SGML) Metalanguage for markup. – Includes rules for defining markup language – Use of SGML includes • Description of structure of markup • Text marked with tags Document Type Declaration (DTD) – Describes and names tags and how they are related – Comments used to express interpretation of tags (meaning, presentation, …)

SGML DTD Example <!– SGML DTD for electronic messages - - > <! ELEMENT

SGML DTD Example <!– SGML DTD for electronic messages - - > <! ELEMENT e-mail -(prolog, contents) > <! ELEMENT prolog -(sender, address+ , subject? , Cc*) > <! ELEMENT (sender | address | subject | Cc) -0 (#PCDATA) > <! ELEMENT contents -(par | image | audio)+ > <! ELEMENT par -0 (ref | #PCDATA)+> <! ELEMENT ref -0 EMPTY > <! ELEMENT (image | audio) -(#NDATA) > <! ATTLIST e-mail id date_sent status <! ATTLIST ref id <! ATTLIST (image | audio) id ID #REQUIRED DATE #REQUIRED (secret | public ) public > IDREF #REQUIRED >

SGML Example <!– DOCTYPE e-mail SYSTEM “e-mail. dtd”> <e-mail id=94108 rby date_sent=02101998> <prolog> <sender>

SGML Example <!– DOCTYPE e-mail SYSTEM “e-mail. dtd”> <e-mail id=94108 rby date_sent=02101998> <prolog> <sender> Pablo Neruda</sender> <address> Federico Garcia Lorca</address> <address> Ernest Hemingway</address> <subject> Picture of my house in Isla <Cc> Gabriel Garcia Marquez</Cc> </prolog> <contents> <par> Here are two photos. One is of the view (photo <ref idref=F 2>). </par> <image id=F 1> “photo 1. gif” </image> <image id=F 2> “photo 2. jpg” </image> </contents> </e-mail>

SGML Characteristics DTD provides ability to determine if a given document is well-formed. SGML

SGML Characteristics DTD provides ability to determine if a given document is well-formed. SGML generally does not specify presentation/appearance. Output specification standards: – DSSSL (Document Style Semantic Specification Language) – FOSI (Formatted Output Specification Instance)

Hyper. Text Markup Language (HTML) Based on SGML – HTML DTD not explicitly referenced

Hyper. Text Markup Language (HTML) Based on SGML – HTML DTD not explicitly referenced by documents HTML documents can have documents embedded within them – Images or audio – Frames with other HTML documents When programs are included, it is referred to as Dynamic HTML Strict HTML includes only non-presentational markup. – Cascade Style Sheets (CSS) used to define presentation In reality, presentational and structural markup are blended by HTML authoring applications.

(Original) HTML Limitations In contrast to SGML: – Users cannot specify their own tags

(Original) HTML Limitations In contrast to SGML: – Users cannot specify their own tags or attributes. – No support for nested structures that can represent database schemas or objectoriented hierarchies. – No support for validation of document by consuming applications.

e. Xtensible Markup Language (XML) XML is a simplified subset of SGML – XML

e. Xtensible Markup Language (XML) XML is a simplified subset of SGML – XML is a meta-language – XML designed for semantic markup that is both human and machine readable – No DTD is required – All tags must be closed Extensible Style sheet Language (XSL) – XML equivalent of CSS – Can be used to convert XML into HTML and CSS

Multimedia Lots of data file formats for non-textual data – Images • BMP, GIF,

Multimedia Lots of data file formats for non-textual data – Images • BMP, GIF, JPEG (JPG), TIFF – Audio • AU, MIDI, WAVE, MP 3 – Video • MPEG, AVI, Quick. Time – Graphics / Virtual Environments • CGM, VRML, Open. GL

Audio and Video Data files often have: – Header • Indicates time granularity, number

Audio and Video Data files often have: – Header • Indicates time granularity, number of channels, bits per channel • Somewhat like a DTD – Data • The signal Data may be compressed – Data may be in frequency domain rather than time domain – Data may be encoded as sequence of differences between consecutive time segments.