WMES 3103 INFORMATION RETRIEVAL TEXT AND MULTIMEDIA LANGUAGES

  • Slides: 17
Download presentation
WMES 3103 : INFORMATION RETRIEVAL TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES

WMES 3103 : INFORMATION RETRIEVAL TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES

INTRODUCTION n n Text - main form of communicating data and information Text also

INTRODUCTION n n Text - main form of communicating data and information Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive Website with a combination ot text and multimedia will be visited by many as compared to one which is text-based only IRS - text and multimedia is depicted via special languages.

Metadata New concept on information – metadata n Information about data arrangement, data domain

Metadata New concept on information – metadata n Information about data arrangement, data domain and relationship between the two n Data about data n 2 types – descriptive and semantic n

n n descriptive Metadata – metadata which explain about document or one unit of

n n descriptive Metadata – metadata which explain about document or one unit of information Commonly used Metadata : n n n Authors Date of publication Source of publication Length of document Type of document

Metadata semantic Metadata –resembles subject that can be obtain from the contents of the

Metadata semantic Metadata –resembles subject that can be obtain from the contents of the document – subjects heading n Keywords n LC Code n

TEXT n n With computers, we need to code text into binary digits First

TEXT n n With computers, we need to code text into binary digits First coding schemes – EBCDIC and ASCII – 7 bits to code each symbol Then, ASCII changed to 8 bits to accommodate other languages, accents and diacritical marks Oriental languages – Unicode – 16 bits

TEXT Formats n No one single format for a text document n Good IRS

TEXT Formats n No one single format for a text document n Good IRS system should be able to retrieve information from any format n Initially, IRS will convert a document to an internal format but this had a lot of disadvantages n Now, many new format has been developed for document interchange

TEXT n n n RTF – Rich Text Format for word processing PDF –

TEXT n n n RTF – Rich Text Format for word processing PDF – Portable Document Format for displaying and printing documents Postscript – powerful programming language for drawing MIMT – Multipurpose Internet Mail Exchange to encode e-mail Files are compressed – Compress (Unix), ARJ (PCs), ZIP Convert binary files to ASCII text – uuencode/uudecode, binhex

MARKUP LANGUAGES n n n Markup = extra textual syntax that can be used

MARKUP LANGUAGES n n n Markup = extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc. Formal markup languages are more structured Marks = tags - initial and ending tag surrounding the marked text Standard metalanguage = SGML New metalanguange for Web = XML (e. Xtensible Markup Language) = subset of SGML Most popular markup language used for the Web = HTML (Hyper. Text Markup Language)

MULTIMEDIA n n Applications that handle different types of digital data originating from distinct

MULTIMEDIA n n Applications that handle different types of digital data originating from distinct types of media Text, sound, images, video Digital data distinct and different in volume, format, and processing requirements Different types of formats necessary for storing each type of media

MULTIMEDIA n Different formats used commonly on the Web and in digital libraries n

MULTIMEDIA n Different formats used commonly on the Web and in digital libraries n n n Images Audio Moving Images Textual Images Graphics and Virtual Reality

IMAGES n n n XBM, BMP, PCX – direct representation of a bitmapped (or

IMAGES n n n XBM, BMP, PCX – direct representation of a bitmapped (or pixel-based) GIF (Graphic Interchange Format) – includes compression and good for black or white or with small number of clours or gray levels (256) JPEG (Joint Photographic Experts Group) – includes compression TIFF (Tagged Image File Format) – used to exchange different documents between different applications and different computer platforms TGA (Television Targa image file) – associated with video game boards Various other image formats

AUDIO n n Must be digitized before storage AU, MIDI (standard format to interchange

AUDIO n n Must be digitized before storage AU, MIDI (standard format to interchange music between electronic instruments and computers), WAVE – for small pieces of digital audio Audio libraries – Real. Audio or CD formats Animation or moving pictures n n MPEG (Moving Pictures Expert Group) – related to JPEG Others – AVI, FLI, Quick. Time

TEXTUAL IMAGES n n n Images that contain mainly typed or typeset text Obtained

TEXTUAL IMAGES n n n Images that contain mainly typed or typeset text Obtained by scanning the documents For archival purposes Saved as images but with further compression Textual and non-textual stored and compressed separately and when neded can be combined and displayed together

GRAPHICS AND VIRTUAL REALITY n n n 3 -dimensional graphics found on Web CGM

GRAPHICS AND VIRTUAL REALITY n n n 3 -dimensional graphics found on Web CGM (Computer Graphics Metafile) standard Metafile = collection of elements CGM standard specifies which elements are allowed to occur in which positions in a metafile VRML (Virtual Reality Modeling Language) – file format for describing interactive 3 D objects and worlds - universal interchange format for 3 D graphics and multimedia - can be used for various applications

MULTIMEDIA DOCUMENTS MARKUP n n Hy. Time = Hyper/Time-based Structuring Language – standard defined

MULTIMEDIA DOCUMENTS MARKUP n n Hy. Time = Hyper/Time-based Structuring Language – standard defined for multimedia documents markup SGML architecture which specifies the generic hypermedia structure of documents