Lecture 18 Metadata Controlled Vocabulary Introduction SIMS 202
Lecture 18: Metadata & Controlled Vocabulary Introduction SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10: 30 am - 12: 00 am Fall 2004 IS 202 - FALL 2004. 10. 28 - SLIDE 1
Lecture Contents • Review – Lexical Relations – Word. Net • • • Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 2
Lecture Contents • Review – Lexical Relations – Word. Net • • • Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 3
Syntax • The syntax of a language is to be understood as a set of rules which accounts for the distribution of word forms throughout the sentences of a language • These rules codify permissible combinations of classes of word forms IS 202 - FALL 2004. 10. 28 - SLIDE 4
Semantics • Semantics is the study of linguistic meaning • Two standard approaches to lexical semantics (cf. , sentential semantics; and, logical semantics): – (1) compositional – (2) relational IS 202 - FALL 2004. 10. 28 - SLIDE 5
Pragmatics • Deals with the relation between signs or linguistic expressions and their users • Deixis (literally “pointing out”) – E. g. , “I’ll be back in an hour” depends upon the time of the utterance • Conversational implicature – A: “Can you tell me the time? ” – B: “Well, the milkman has come. ” [I don’t know exactly, but perhaps you can deduce it from some extra information I give you. ] • Presupposition – “Are you still such a bad driver? ” • Speech acts – Constatives vs. performatives – E. g. , “I second the motion. ” • Conversational structure – E. g. , turn-taking rules IS 202 - FALL 2004. 10. 28 - SLIDE 6
Lexical Relations • Conceptual relations link concepts – Goal of Artificial Intelligence • Lexical relations link words – Goal of Linguistics IS 202 - FALL 2004. 10. 28 - SLIDE 7
Major Lexical Relations • • • Synonymy Polysemy Metonymy Hyponymy/Hypernymy Meronymy/Holonymy Antonymy IS 202 - FALL 2004. 10. 28 - SLIDE 8
Word. Net • Started in 1985 by George Miller, students, and colleagues at the Cognitive Science Laboratory, Princeton University – Miller also known as the author of the paper “The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Processing Information” (1956) • Can be downloaded for free: – www. cogsci. princeton. edu/~wn/ IS 202 - FALL 2004. 10. 28 - SLIDE 9
Structure of Word. Net IS 202 - FALL 2004. 10. 28 - SLIDE 10
Structure of Word. Net IS 202 - FALL 2004. 10. 28 - SLIDE 11
Structure of Word. Net IS 202 - FALL 2004. 10. 28 - SLIDE 12
Lecture Contents • Review – Lexical Relations – Wordnet • • • Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 13
Organization of Information • Is there a basic human need to put things into some sort of order? – Much of natural language concerns categories of things rather than individual things – Why do we organize things and information? • Why do spoons go in THAT drawer in the kitchen and not in a can in the garage? • Why do your favorite books go on one shelf and not-so-favorite on another? IS 202 - FALL 2004. 10. 28 - SLIDE 14
Why Organize Information? • The main reason – So that you can find things more effectively • I. e. , effective retrieval is predicated on some sort of organization applied to information resources • Historically there have been many institutions and tools devoted to information organization – – Libraries Museums Archives Indexes and catalogs, dictionaries, phone books, etc. IS 202 - FALL 2004. 10. 28 - SLIDE 15
Why Organize Information? • A question of scale – Using your own ad hoc set of categories and methods to organize your own collection of books or CDs seems to work fine… – What if your collection grew to • • IS 202 - FALL 2004 10 Times the size? How would you organize it? 100 Times? 100000 times? 2004. 10. 28 - SLIDE 16
What is Information Organization? • Identifying the existence of all types of information-bearing entities as they are made available • Identifying the works contained within those information-bearing entities or as parts of them • Systematically pulling together these information -bearing entities into collections in libraries, archives, museums, Internet communications files and other such depositories From Hagler via Taylor, Chap. 1 IS 202 - FALL 2004. 10. 28 - SLIDE 17
What is Information Organization? • Producing lists of these informationbearing entities prepared according to standard rules for citation • Providing name, title, subject and other useful access to these information-bearing entities • Providing the means of locating each information-bearing entity or a copy of it IS 202 - FALL 2004. 10. 28 - SLIDE 18
Key Issues in This Course • How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them – Organizing • How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs – Retrieving IS 202 - FALL 2004. 10. 28 - SLIDE 19
Key Issues Creation Active Authoring Modifying Using Creating Retention/ Mining Organizing Indexing Accessing Filtering Storing Retrieval Semi-Active Discard Utilization Disposition Distribution Networking Searching Inactive IS 202 - FALL 2004. 10. 28 - SLIDE 20
Organizing/Indexing • Collecting and integrating information • Affects data, information and metadata • “Metadata” describes data and information – More on this shortly • Organizing information – Types of organization? • Indexing IS 202 - FALL 2004. 10. 28 - SLIDE 21
Accessing/Filtering • Using the organization created in the O/I stage to – Select desired (or relevant) information – Locate that information – Retrieve the information from its storage location (often via a network) IS 202 - FALL 2004. 10. 28 - SLIDE 22
Structure of an IR System Interest profiles & Queries Formulating query in terms of descriptors Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage of profiles Store 1: Profiles/ Search requests Documents & data Storage Line Indexing (Descriptive and Subject) Storage of Documents Comparison/ Matching Store 2: Document representations Potentially Relevant Documents IS 202 - FALL 2004. 10. 28 - SLIDE 23
Lecture Contents • Review – Lexical Relations – Word. Net • • • Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 24
Metadata • Metadata is – “Data about Data” (database systems) – Information about Information • First used (to the best we can discover) in 1978 (meta-data) • Used for databases in (Meta-Data Base) – “a data base which itself contains the structural and semantic data of other data bases” » Thomas R. Cousins & Wayne D. Dominick, “The Management of Data Bases” ASIS Proceedings, 1978. IS 202 - FALL 2004. 10. 28 - SLIDE 25
Metadata • Structures and languages for the description of information resources and their elements (components or features) • “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142) IS 202 - FALL 2004. 10. 28 - SLIDE 26
Metadata • Often two main types of metadata are distinguished – Descriptive metadata • Describes the information/data object and its properties • May use a variety of descriptive formats and rules – Topical metadata • Describes the topic or “aboutness” of an information/data object • May include a variety of vocabularies for describing, subjects, topics, categories, etc. IS 202 - FALL 2004. 10. 28 - SLIDE 27
Types of Metadata • • • Element names Element description Element representation Element coding Element semantics Element classification IS 202 - FALL 2004. 10. 28 - SLIDE 28
Metadata Systems and Standards • Naming and ID systems • Bibliographic description – Texts • • • Music Images and objects Numeric data Geospatial data Collections Video and motion pictures IS 202 - FALL 2004. 10. 28 - SLIDE 29
The Same Item in Different Metadata Systems • • • ISBD RFC 1807 TEI Header MARC Record Dublin Core (a bit later) IS 202 - FALL 2004. 10. 28 - SLIDE 30
ISBD Punctuation • Title Proper (GMD) = Parallel title : other title info / First statement of responsibility ; others. -- Edition information. -- Material. -- Place of Publication : Publisher Name, Date. -Material designation and extent ; Dimensions of item. -- (Title of Series / Statement of responsibility). -- Notes. -- Standard numbers: terms of availability (qualifications). IS 202 - FALL 2004. 10. 28 - SLIDE 31
Bibliographic Record • Introduction to cataloging and classification / Bohdan S. Wynar. -- 8 th ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, 1992. -- (Library science text series). IS 202 - FALL 2004. 10. 28 - SLIDE 32
RFC 1807 • • • BIB-VERSION: : CS-TR-v 2. 1 ID: : UCB//123456 ENTRY: : September 9, 1997 TYPE: : BOOK TITLE: : Introduction to cataloging and classification AUTHOR: : Wynar, Bohdan S. AUTHOR: : Taylor, Arlene G. DATE: : 1992 PAGES: : 633 COPYRIGHT: : Libraries Unlimited, 1992 SERIES: : Library Science Text Series END: : UCB//123456 IS 202 - FALL 2004. 10. 28 - SLIDE 33
Minimal TEI Header • • • • • <tei. Header> <file. Desc> <title. Stmt> <title> Introduction to cataloging and classification</title> <resp. Stmt><name>Bohdan S. Wynar<resp> 8 th edition by</resp> <name>Arlene G. Taylor</name> </resp. Stmt> </title. Stmt> <publication. Stmt> <distributor>Libraries Unlimited</distributor> </publication. Stmt> <source. Desc> <bibl> Introduction to cataloging and classification / Bohdan S. Wynar. -- 8 th ed. / Arlene G. Taylor. -- Englewood, Colo. : Libraries Unlimited, 1992. </bibl> </source. Desc> </file. Desc> <tei. Header> IS 202 - FALL 2004. 10. 28 - SLIDE 34
MARC Record (Display) • • • • • • ID: DCLC 9124851 -B RTYP: c ST: p FRN: MS: c EL: AD: 06 -20 -91 CC: 9110 BLT: am DCF: a CSC: MOD: SNR: ATC: UD: 04 -11 -92 CP: cou L: eng INT: GPC: BIO: FIC: 0 CON: b PC: s PD: 1992/ REP: CPI: 0 FSI: 0 ILC: a II: 1 MMD: OR: POL: DM: RR: COL: EML: GEN: BSE: 010 9124851 020 0872878112 (cloth) 020 0872879674 (paper) 040 DLC$c. DLC$d. DLC 050 00 Z 693$b. W 94 1991 082 00 025. 3$220 100 1 Wynar, Bohdan S. 245 10 Introduction to cataloging and classification /$ c. Bohdan S. Wynar. 250 8 th ed. /$b. Arlene G. Taylor. 260 Englewood, Colo. : $b. Libraries Unlimited, $c 1992. 300 xvii, 633 p. : $bill. ; $c 24 cm. 440 0 Library science text series 504 Includes bibliographical references (p. 591 -599) and index. 650 0 Cataloging. 650 0 Subject cataloging. 650 0 Classification$x. Books. 630 00 Anglo-American cataloguing rules. 700 10 Taylor, Arlene G. , $d 1941 - IS 202 - FALL 2004. 10. 28 - SLIDE 35
Lecture Contents • Review – Lexical Relations – Word. Net • • • Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 36
Dublin Core • Simple metadata for describing internet resources • For “Document-Like Objects” • 15 Elements (in base DC) IS 202 - FALL 2004. 10. 28 - SLIDE 37
Dublin Core • • • • TITLE: Introduction to cataloging and classification CREATOR: Taylor, Arlene G. OTHER CONTRIBUTOR: Wynar, Bohdan S. DATE: 1992 FORMAT: BOOK LANGUAGE: ENG PAGES: 633 PUBLISHER: Libraries Unlimited SUBJECT: Cataloging. SUBJECT: subject cataloging. SUBJECT: Classification -- Books DESCRIPTION: Textbook on cataloging and classification RESOURCE TYPE: text. monograph RESOURCE IDENTIFIER: (ISBN) 0872879674 IS 202 - FALL 2004. 10. 28 - SLIDE 38
Dublin Core Elements • • Title Creator Subject Description Publisher Other Contributors Date Resource Type IS 202 - FALL 2004 • • Format Resource Identifier Source Language Relation Coverage Rights Management 2004. 10. 28 - SLIDE 39
Mega-Metadata Standards • METS - Metadata Encoding and Transmission Standard (http: //www. loc. gov/standards/mets) – Developed by the Digital Library Federation as an implementation strategy for preservation metadata – "XML document format for encoding metadata necessary for both management of digital library objects within a repository and exchange of such objects between repositories (or between repositories and their users)” – Provides a flexible mechanism for encoding descriptive, administrative, and structural metadata for a digital library object, and for expressing the complex links between these various forms of metadata IS 202 - FALL 2004. 10. 28 - SLIDE 40
Metadata Resources • Check the Links section from the class home page • Best site is the “Digital Library: Metadata Resources” page from IFLA at http: //www. ifla. org/II/metadata. htm • For another good source of information on metadata standards see http: //www. chin. gc. ca/English/Standards IS 202 - FALL 2004. 10. 28 - SLIDE 41
Lecture Contents • Review – Lexical Relations – Word. Net • • • Organization of Information Metadata Dublin Core Controlled Vocabularies (Introduction) Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 42
Controlled Vocabularies • Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc. ) with the intent of aiding the searcher in finding information • That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata IS 202 - FALL 2004. 10. 28 - SLIDE 43
Controlled Vocabularies • • • Names and name authorities Gazetteers (geographic names) Code lists (e. g. , LC language codes) Subject heading lists Classification schemes Thesauri IS 202 - FALL 2004. 10. 28 - SLIDE 44
Control of Names • Cutter’s (1876) objectives of bibliographic description – To enable a person to find a document of which • The author, or • The title, or • The subject is known – To show what a library has • By a given author • On a given subject (and related subjects) • In a given kind (or form) of literature. • First serves access • Second serves collocation IS 202 - FALL 2004. 10. 28 - SLIDE 45
Problems with Names • How many names should be associated with a document? • Which of these should be the “main entry? ” • What form should each of the names take? • What references should be made from other possible forms of names that haven’t been used? IS 202 - FALL 2004. 10. 28 - SLIDE 46
The Problem • Proliferation of the forms of names – Different names for the same person – Different people with the same names • Examples – from Books in Print (semi-controlled but not consistent) – ERIC author index (not controlled) IS 202 - FALL 2004. 10. 28 - SLIDE 47
Goethe …etc… IS 202 - FALL 2004. 10. 28 - SLIDE 48
John Muir IS 202 - FALL 2004. 10. 28 - SLIDE 49
Pauline Cochrane nee Atherton IS 202 - FALL 2004. 10. 28 - SLIDE 50
Pauline Cochrane nee Atherton IS 202 - FALL 2004. 10. 28 - SLIDE 51
Rules for Description • AACR II and other sets of descriptive cataloging rules provide guidelines for: – Determining the number of name entries – Choosing a main entry – Deciding on the form of name to be used – Deciding when to make references IS 202 - FALL 2004. 10. 28 - SLIDE 52
Authority Control • Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules • If you have rules, why do you need to keep track of all of the headings? Can’t you just infer the headings from the rules? IS 202 - FALL 2004. 10. 28 - SLIDE 53
Conditions of Authorship? • Single person or single corporate entity • Unknown or anonymous authors – Fictitiously ascribed works • Shared responsibility • Collections or editorially assembled works • Works of mixed responsibility (e. g. , translations) • Related works IS 202 - FALL 2004. 10. 28 - SLIDE 54
Choice of Name • AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name • References should be made from the other forms of the name IS 202 - FALL 2004. 10. 28 - SLIDE 55
Form of the Name • When names appear in multiple forms, one form needs to be chosen • Criteria for choice are: – Fullness (e. g. , full names vs. initials only) – Language of the name – Spelling (choose predominant form) • Entry element: – John Smith or Smith, John? – Mao Zedong or Zedong, Mao? (Mao Tse Tung? ) IS 202 - FALL 2004. 10. 28 - SLIDE 56
Name Authority Files ID: NAFL 8057230 ST: p EL: n STH: a MS: c UIP: a TD: 19910821174242 KRC: a NMU: a CRC: c UPN: a SBU: a SBC: a DID: n DF: 05 -14 -80 RFE: a CSC: SRU: b SRT: n SRN: n TSS: TGA: ? ROM: ? MOD: VST: d 08 -21 -91 Other Versions: earlier 040 DLC$c. DLC$d. OCo. LC 053 PR 6005. R 517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret, $d 1908 -1973 400 10 Cooper, Henry St. John, $d 1908 -1973 400 00 Credo, $d 1908 -1973 400 10 Fecamps, Elise 400 10 Gill, Patrick, $d 1908 -1973 400 10 Hope, Brian, $d 1908 -1973 400 10 Hughes, Colin, $d 1908 -1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry, $d 1908 -1973 400 10 Wilde, Jimmy 500 10 $wnnnc$a. Ashe, Gordon, $d 1908 -1973 Different names for the same person IS 202 - FALL 2004. 10. 28 - SLIDE 57
Name Authority Files ID: NAFO 9114111 ST: p EL: n STH: a MS: n UIP: a TD: 19910817053048 KRC: a NMU: a CRC: c UPN: a SBU: a SBC: a DID: n DF: 06 -03 -91 RFE: a CSC: c SRU: b SRT: n SRN: n TSS: TGA: ? ROM: ? MOD: VST: d 08 -19 -91 040 OCo. LC$c. OCo. LC 100 10 Marric, J. J. , $d 1908 -1973 500 10 $wnnnc$a. Creasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$b. Crease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg. : Creasey, John; usage: J. J. Marric) 670 LC data base, 6/10/91$b(hdg. : Creasey, John; usage: J. J. Marric) 670 Pseuds. and nicknames dict. , c 1987$b(Creasey, John, 1908 -1973; Britis h author; pseud. : Marric, J. J. ) IS 202 - FALL 2004. 10. 28 - SLIDE 58
Name Authority Files ID: NAFL 8166762 ST: p EL: n STH: a MS: c UIP: a TD: 19910604053124 KRC: a NMU: a CRC: c UPN: a SBU: a SBC: a DID: n DF: 08 -20 -81 RFE: a CSC: SRU: b SRT: n SRN: n TSS: TGA: ? ROM: ? MOD: VST: d 06 -06 -91 Other Versions: earlier 040 DLC$c. DLC$d. OCo. LC 100 10 Butler, William Vivian, $d 1927400 10 Butler, W. V. $q(William Vivian), $d 1927400 10 Marric, J. J. , $d 1927670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c 1981: $bt. p. (W. V. Butler) 670 His Gideon's way, 1986: $b. CIP t. p. (William Vivian Butler writing as J. J. Marric) Different people writing with the same name IS 202 - FALL 2004. 10. 28 - SLIDE 59
The Haunting of Lauran Paine 1. Paine, Lauran. ALSO KNOWN AS: Carrel, Mark. Thompson, Russ. Andrews, A. A. Benton, Will. Bradford, Will. Bradley, Concho. Brennan, Will. Carter, Nevada. Allen, Clay. Almonte, Rosa. Armour, John. Cassady, Claude. Glendenning, Donn. Kelley, Ray. Kilgore, John. Martin, Tom. Slaughter, Jim. Standish, Buck. … IS 202 - FALL 2004 Batchelor, Reg. Beck, Harry. Bedford, Kenneth. Bosworth, Frank. Bovee, Ruth. Cassidy, Claude. Custer, Clint. Dana, Amber. Dana, Richard. Davis, Audrey. Drexler, J. F. Duchesne, Antoinette. Fisher, Margot. Fleck, Betty. Frost, Joni. Gordon, Angela. Gorman, Beth. Hayden, Jay. Houston, Will. Howard, Troy. Ingersol, Jared. … Kelly, Ray. Ketchum, Jack. Liggett, Hunter. Lucas, J. K. Lyon, Buck. Morgan, Arlene. Morgan, Valerie. O'Connor, Clint. St. George, Arthur. Sharp, Helen. Thorn, Barbara. Archer, Dennis. Clark, Badger. 2004. 10. 28 - SLIDE 60
Some Interesting Ones… IS 202 - FALL 2004. 10. 28 - SLIDE 61
Structure of an IR System Search Line Interest profiles & Queries Formulating query in terms of descriptors Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage of profiles Store 1: Profiles/ Search requests Storage Line Indexing (Descriptive and Subject) Storage of Documents Comparison/ Matching Potentially Relevant Documents IS 202 - FALL 2004 Documents & data Store 2: Document representations Adapted from Soergel, p. 19 2004. 10. 28 - SLIDE 62
Uses of Controlled Vocabularies • Library subject headings, classification, and authority files • Commercial journal indexing services and databases • Yahoo, and other web classification schemes • Online and manual systems within organizations – Sun. Solve – Mac. Arthur IS 202 - FALL 2004. 10. 28 - SLIDE 63
Types of Indexing Languages • Uncontrolled keyword indexing • Indexing languages – Controlled, but not structured • Thesauri – Controlled and structured • Classification systems – Controlled, structured, and coded • Faceted thesauri and classification systems • Much more on these topics later… IS 202 - FALL 2004. 10. 28 - SLIDE 64
Lecture Contents • Review – Lexical Relations – Word. Net • • • Organization of Information Metadata Dublin Core Controlled Vocabularies Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 65
Discussion IS 202 - FALL 2004. 10. 28 - SLIDE 66
Next Time • Introduction to the Phone Project • Readings/discussion – Information Architecture (Rosenfeld) IS 202 - FALL 2004. 10. 28 - SLIDE 67
- Slides: 67