CS 430 INFO 430 Information Retrieval Lecture 19

  • Slides: 36
Download presentation
CS 430 / INFO 430 Information Retrieval Lecture 19 Metadata 1 1

CS 430 / INFO 430 Information Retrieval Lecture 19 Metadata 1 1

Course Administration 2

Course Administration 2

Descriptive Metadata Some methods of information retrieval search and browse descriptive metadata about the

Descriptive Metadata Some methods of information retrieval search and browse descriptive metadata about the objects. Descriptive metadata typically consists of a catalog or indexing record, or an abstract, one record for each object. The record acts as a surrogate for the object. • Usually the metadata is stored separately from the object that it describes, but sometimes is embedded in the object. • Usually the metadata is a set of text fields. Textual metadata can be used to describe non-textual objects, e. g. , software, images, music. 3

Documents and Surrogates Document The sea is calm to-night. The tide is full, the

Documents and Surrogates Document The sea is calm to-night. The tide is full, the moon lies fair Upon the straits; -- on the French coast the light Gleams and is gone; the cliffs of England stand, Glimmering and vast, out in the tranquil bay. 4 Come to the window, sweet is the night-air! Only, from the long line of spray Where the sea meets the moon-blanch'd land, Listen! you hear the grating roar Of pebbles which the waves draw back, and fling, At their return, up the high strand, Begin, and cease, and then again begin, With tremulous cadence slow, and bring The eternal note of sadness in. Surrogate (catalog record) Author: Matthew Arnold Title: Dover Beach Genre: Poem Date: 1851 Notes: 1. The surrogate is also a document 2. Every word is different!

Surrogates for Non-textual materials Text based methods of information retrieval can search a surrogate

Surrogates for Non-textual materials Text based methods of information retrieval can search a surrogate for a photograph Document Surrogate (catalog record) See next page for a textual catalog record about a non-textual item (photograph). 5

Library of Congress catalog record (part) CREATED/PUBLISHED: [between 1925 and 1930? ] SUMMARY: U.

Library of Congress catalog record (part) CREATED/PUBLISHED: [between 1925 and 1930? ] SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on. NOTES: Title supplied by cataloger. Source: Morey Engle. SUBJECTS: Coolidge, Calvin, --1872 -1933. Presidents--United States--1920 -1930. Autographing--Colorado--Denver--1920 -1930. Denver (Colo. )--1920 -1930. Photographic prints. 6 MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in. )

Categories of Descriptive Metadata Catalog: metadata records that have a consistent structure, organized according

Categories of Descriptive Metadata Catalog: metadata records that have a consistent structure, organized according to systematic rules. (Example: Library of Congress Catalog) Abstract: a free text record that summarizes a longer document. Indexing record: less formal than a catalog record, but more structured than a simple abstract. (Example: Pub. Med) 7

Metadata Format A metadata format is a set of rules that describe the content

Metadata Format A metadata format is a set of rules that describe the content and format of a set of metadata records, e. g. : 8 • AACR (Anglo American Cataloging Rules) / MARC • Dublin Core • FGDC (Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata) • IEEE Standard for Learning Object Metadata

Uses of Metadata in Information Retrieval Metadata is used in Information Retrieval systems in

Uses of Metadata in Information Retrieval Metadata is used in Information Retrieval systems in conjunction with or instead of full text indexing: • For physical objects, e. g. , books • For non-textual materials, e. g. , pictures, maps, datasets • For specialized areas where high recall is important (e. g. , medicine), or where features such as intended audience are hard to extract from the text (e. g. , education) • When people are ignorant of the power of full text indexing (which is surprisingly common) 9

Uses of Metadata in Information Retrieval Descriptive metadata provides capabilities that are not possible

Uses of Metadata in Information Retrieval Descriptive metadata provides capabilities that are not possible with full text indexing: • Allows fielded searching author = "Goethe" • Suitable for non-textual material type = "picture" and subject = "Ithaca" • Can be used with controlled vocabulary language = "en" 10 (English)

Information Retrieval with High Recall Full-text Indexing (automated) • Text only. Most effective on

Information Retrieval with High Recall Full-text Indexing (automated) • Text only. Most effective on medium-length documents on related topics. High recall requires tuning system to the specific collection and skilled users. Catalogs and Indexes (created manually) 11 • Can be used for all formats of material • Requires close quality control of metadata creation • High recall requires tuning system to the specific collection and skilled users.

Using Metadata for Information Retrieval The basic operation of information retrieval is to match

Using Metadata for Information Retrieval The basic operation of information retrieval is to match the way that a user describes an information requirement (a query), against the way that items are described (an index). The success of conventional catalogs (e. g. , MARC + Anglo-American Cataloguing Rules) or indexing services (e. g. , Medline) comes from the combination of: • precise language to describe items • trained and experienced users to formulate queries. 12

Library Catalogs Examples: Cornell University Library catalog: http: //catalog. library. cornell. edu/ Library of

Library Catalogs Examples: Cornell University Library catalog: http: //catalog. library. cornell. edu/ Library of Congress, Prints and Photographs: http: //www. loc. gov/rr/print/catalog. html 13

Origins of Library Catalogs Bibliographic Objective: • To bring together like items • To

Origins of Library Catalogs Bibliographic Objective: • To bring together like items • To differentiate among similar ones Sir Anthony Panizzi, Keeper of Books at the British Museum (1856 -67). His Ninety-One Rules (1841) were the basis of modern catalog rules. 14

Origins of Library Catalogs Information Discovery: • to enable a person to find a

Origins of Library Catalogs Information Discovery: • to enable a person to find a book of which either the author, title or subject is known • to show what the library has by a given author, on a given subject, or in a given kind of literature • to assist in the choice of a book as to its edition (bibliographically) or to its character (literary or topical). Charles Ammi Cutter Librarian of the Boston Athenaeum Rules for a Dictionary Catalog, 1874 15

Origins of Library Catalogs Classification: • Division of subject matter into a hierarchy. •

Origins of Library Catalogs Classification: • Division of subject matter into a hierarchy. • Typically used in libraries to provided a subjectbased order for shelving books. Melvil Dewey Acting Librarian of Amherst College (1874) Dewey Decimal system of book classification, uses the numbers 000 to 999 to cover the general fields of knowledge and decimals to fit special subjects. 16

Library Catalogs: Technology Changes over the Years Materials to be catalogued: • Originally books

Library Catalogs: Technology Changes over the Years Materials to be catalogued: • Originally books • Extended to serials, maps, music, etc. , but concepts still rely heavily on experience with books Form of catalog: 17 • Entries in books (Panizzi) • Index cards (Cutter) • Online databases (Kilgour)

Shared Cataloguing: OCLC -- Large centralized transaction processing database system (http: //www. oclc. org/)

Shared Cataloguing: OCLC -- Large centralized transaction processing database system (http: //www. oclc. org/) When a library catalogs a book it deposits MARC record in OCLC Other libraries can copy the record • saves duplication of cataloguing • OCLC has a database of holdings from all libraries OCLC database has 69 million records, serves 42, 000 libraries When developed by Fred Kilgour in 1967, OCLC was a pioneering computer system (had to develop own network, computer terminal, etc. ) 18

Catalogs as Investments Costs: • Conventional Catalog Records are created by skilled librarians. (cost

Catalogs as Investments Costs: • Conventional Catalog Records are created by skilled librarians. (cost estimate $100 per record). • OCLC's catalog has 69 million records. Total investment is several billion dollars. Cataloguing Standards: 19 • Enable libraries to share records • Combine records of the past with records created today • Allow readers and librarians to move between libraries

Layers of a Library Catalog Encoding • Rules that define how catalog records are

Layers of a Library Catalog Encoding • Rules that define how catalog records are encoded in a computer system, e. g. , XML mark-up. Syntax • Rules that define the fields and subfields, whether repeated, optional, etc. Semantics • Rules that define the values of the field and subfield, with instructions for cataloguers of what data to include and how to decide when choices have to be made. 20

Library Cataloging using the Anglo American Cataloguing Rules (AACR 2) • Rules for each

Library Cataloging using the Anglo American Cataloguing Rules (AACR 2) • Rules for each category of material, e. g. , monographs (books). Specify what fields should be used and what data to include in each field. Text strings were originally intended for printed catalog cards. MARC format • An exchange format for catalog records. Includes encoding rules and syntax specification. "MARC Catalog" 21 • Catalog in MARC format, where content of each field follows AACR 2.

Anglo American Cataloguing Rules The Anglo American Cataloguing (AACR) rules provide detailed rules for

Anglo American Cataloguing Rules The Anglo American Cataloguing (AACR) rules provide detailed rules for • the choice of fields • the content of the data that goes into each field • the syntax of the data that goes into each field The rules are an excellent example of technical writing: precise but clear. For an example, see: http: //www. cs. cornell. edu/Courses/cs 430/2006 fa/slides/AACR. pdf 22

Name authority files An Authority File "brings together like items and differentiates among similar

Name authority files An Authority File "brings together like items and differentiates among similar ones. " • Caroline R. Arms or Caroline Ruth Arms? • Which William Phillips of Cardiff? • Mark Twain or Samuel Clemens? • Epithets: of Cardiff doctor • 23 Dates: 1832 - 1876 flourished 1860 circa 1832 - 1876

Name authority: example LC Control Number: HEADING : 000 001 005 008 010 035

Name authority: example LC Control Number: HEADING : 000 001 005 008 010 035 040 100 400 670 R. Arms, C. R. Arms) 670 n 87870182 Arms, Caroline R. (Caroline Ruth) 00907 cz 2200205 n 450 4383796 19890706143144. 8 70909 n|acannaab |a aaa c __ |a n 87870182 __ |a (DLC)n 87870182 __ |a In. U |c DLC |d DLC 10 |a Arms, Caroline R. |q (Caroline Ruth) 10 |w nna |a Arms, Caroline Ruth 10 |a Arms, C. R. |q (Caroline Ruth) __ |a Arms, W. Y. Report on the performance problems of the RLIN computer system, 1982: |b t. p. (Caroline R. Arms) __ |a LC data base, 8/24/87 |b (hdg. : Arms, Caroline Ruth; usage: Caroline __ |a Campus networking strategies, 1988: |b CIP t. p. (Caroline Arms) __ |a Phone call to pub. , 2/10/88 |b (Caroline Ruth Arms; studied at Oxford) 670 __ |a Campus strategies for libraries and electronic c 1990: |b CIP t. p. (Caroline Arms) data sheet (b. 10 -24 -45) 953 __ |a bz 46 |b bd 24 24 information,

Subject information Library of Congress Subject Headings Academic libraries--United States--Automation Hierarchical classification Library of

Subject information Library of Congress Subject Headings Academic libraries--United States--Automation Hierarchical classification Library of Congress call number: Dewey Decimal Classification: Z 675. U 5 C 16 027. 7 Creation and maintenance of lists of subject headings and classifications is a never ending task. 25

MARC Format The MARC format was developed in the late 1960 s as a

MARC Format The MARC format was developed in the late 1960 s as a tagging scheme for exchanging catalog records on magnetic tape. It remains the standard way to represent such data. At present, MARC is steadily being converted (slowly) to modern computing formats, e. g. , Unicode, XML. 26

MARC: Monograph catalog record Citation Caroline R. Arms, editor, Campus strategies for libraries and

MARC: Monograph catalog record Citation Caroline R. Arms, editor, Campus strategies for libraries and electronic information. Bedford, MA: Digital Press, 1990. 27

MARC fields tag value 001 050 082 245 89 -16879 r 93 Z 675.

MARC fields tag value 001 050 082 245 89 -16879 r 93 Z 675. U 5 C 16 1990 027. 7/0973 20 Campus strategies for libraries and electronic title statement information/Caroline Arms, editor. {Bedford, Mass. } : Digital Press, c 1990. publisher xi, 404 p. : ill. ; 24 cm. collation EDUCOM strategies series on information technology series title Includes bibliographical references (p. {373}-381). ISBN 1 -55558 -036 -X : $34. 95 260 300 440 504 020 28

MARC fields (continued) 650 Academic libraries--United States--Automation. subject heading 650 Libraries and electronic publishing--United

MARC fields (continued) 650 Academic libraries--United States--Automation. subject heading 650 Libraries and electronic publishing--United States. 650 Library information networks--United States. 650 Information technology--United States. 700 Arms, Caroline R. (Caroline Ruth) 040 DLC DLC 043 n-us--955 CIP ver. br 02 to SL 02 -26 -90 985 APIF/MIG 29

MARC Encoding tag: 260 subfield a: {Bedford, Mass. } : subfield b: Digital Press,

MARC Encoding tag: 260 subfield a: {Bedford, Mass. } : subfield b: Digital Press, subfield c: c 1990. Note that the content is designed to be part of a printed catalog record and is not in a convenient format for computer manipulation. MARC encoding: &2600#abc#{Bedford, Mass. } : #Digital Press, #c 1990. % [Definitely not a modern encoding!] 30

Modernizing MARC 1. Keep the content of the catalog record 2. Convert to Unicode

Modernizing MARC 1. Keep the content of the catalog record 2. Convert to Unicode for representing scripts 3. Convert to XML for tagging cataloguing metadata. MARCXML (MARC 21 XML) http: //www. loc. gov/standards/marcxml/ [Direct conversion to XML tagging] Metadata Object Description Schema (MODS) http: //www. loc. gov/standards/mods/ [Subset of MARC with data clean-up] 31

MARC XML • Simple and Flexible MARC XML Schema The schema retains the semantics

MARC XML • Simple and Flexible MARC XML Schema The schema retains the semantics of MARC. Fields are treated as elements with the tag as an attribute and indicators treated as attributes. Subfields are treated as subelements with the subfield code as an attribute. • Lossless Conversion of MARC to XML • Roundtripability from XML back to MARC • Data Presentation by writing a XML stylesheet • Validation of MARC data • Extensibility 32

MODS Example (extracts) <mods> <title. Info> <title>Sound and fury : </title> <sub. Title>the making

MODS Example (extracts) <mods> <title. Info> <title>Sound and fury : </title> <sub. Title>the making of the punditocracy /</sub. Title> </title. Info> <name type="personal"> <name. Part>Alterman, Eric</name. Part> <role. Term type="text">creator</role. Term> </role> </name> 33

MODS Example (extracts) <type. Of. Resource>text</type. Of. Resource> <origin. Info> <place. Term type="text">Ithaca, N.

MODS Example (extracts) <type. Of. Resource>text</type. Of. Resource> <origin. Info> <place. Term type="text">Ithaca, N. Y</place. Term> </place> <publisher>Cornell University Press</publisher> <date. Issued>c 1999</date. Issued> </origin. Info> 34 <language> <language. Term authority="iso 639 -2 b" type="code">eng</language. Term> </language> </mods>

Notes on MARC A great achievement: 35 • Developed in 1960 s • Magnetic

Notes on MARC A great achievement: 35 • Developed in 1960 s • Magnetic tape exchange format for printing catalog records • The dawn of computing: mixed upper and lower case variable length fields, repeated fields non-Roman scripts • 100(? ) million records with standard content and format • Thousands of trained librarians (millions? )

Notes on MARC A great problem: • Not designed for computer algorithms • One

Notes on MARC A great problem: • Not designed for computer algorithms • One record per item (poor links between records) • Tied to traditional materials and traditional practices • Not Unicode • 100 of million records at $100 -- $10 billion A classic legacy system! 36