Synonyms Taxonomies Thesaurus Design for Information Architects ACIA

  • Slides: 117
Download presentation
Synonyms & Taxonomies Thesaurus Design for Information Architects ACIA Seminar by Peter Morville &

Synonyms & Taxonomies Thesaurus Design for Information Architects ACIA Seminar by Peter Morville & Samantha Bailey an

Introductions Peter Morville (morville@argus-inc. com) • CEO, Argus Associates • Co-author, Information Architecture for

Introductions Peter Morville (morville@argus-inc. com) • CEO, Argus Associates • Co-author, Information Architecture for the World Wide Web • Director, ACIA • LIS background • Fortune 500 consulting 2

Introductions Samantha Bailey (bailey@argus-inc. com) • VP of Operations, Argus Associates • LIS background

Introductions Samantha Bailey (bailey@argus-inc. com) • VP of Operations, Argus Associates • LIS background • Fortune 500 consulting • VC experience 3

Seminar Outline I. III. IV. V. VIII. IX. I. II. Thesauri in Context Value

Seminar Outline I. III. IV. V. VIII. IX. I. II. Thesauri in Context Value of Thesauri Methodology Metadata Vocabulary Control Structure & Relationships Thesaurus Management Case Study Related Topics Instructional Methods Exercises, Quizzes, Discussions, Breaks 4

Our Approach Assumptions • Understanding of IA Basics • Interest in Thesauri and the

Our Approach Assumptions • Understanding of IA Basics • Interest in Thesauri and the Web Philosophy • Reality is Important • Technology has Limitations • Success takes Time • Tension can be Healthy 5

Thesauri in Context What is IA? The art and science of structuring and organizing

Thesauri in Context What is IA? The art and science of structuring and organizing information systems to help people achieve their goals. 6

Thesauri in Context An Ecological Approach Books: Information Ecologies by Bonnie Nardi and Information

Thesauri in Context An Ecological Approach Books: Information Ecologies by Bonnie Nardi and Information Ecology by Thomas Davenport 7

Thesauri in Context IA From Top to Bottom Top-Down portal strategy hierarchy primary path

Thesauri in Context IA From Top to Bottom Top-Down portal strategy hierarchy primary path Bottom-Up sub-site objects metadata multiple paths portal Object X Name: Product Category: Topic: Stale Date: Author: Security: local subsites (HR, Engineering, R&D…) 8

Thesauri in Context Where Does IA Fit? http: //www. jjg. net/ia/elements. pdf The Elements

Thesauri in Context Where Does IA Fit? http: //www. jjg. net/ia/elements. pdf The Elements of User Experience Jesse James Garrett 9

Thesauri in Context What is Vocabulary Control? Controlled Vocabulary A list of preferred and

Thesauri in Context What is Vocabulary Control? Controlled Vocabulary A list of preferred and variant terms. A subset of natural language. Preferred Variants Authority AZ Ariz, Arizona, 85 XXX US Postal Service IBM Intl Bus Machines, Big Blue Nyctalopia Night blindness Moon blindness NY Stock Exchange National Library of Medicine 10

Thesauri in Context Why Control Vocabulary? Language is Ambiguous • Synonyms, homonyms, antonyms, contronyms,

Thesauri in Context Why Control Vocabulary? Language is Ambiguous • Synonyms, homonyms, antonyms, contronyms, etc. In the Oxford English Dictionary: • “Round” takes 7 ½ pages or 15, 000 words to define. • “Set” has 58 uses as a noun, 126 as a verb, 10 as an adjective. The Mother Tongue: English & How It Got That Way by Bill Bryson 11

Thesauri in Context Why Control Vocabulary? So Your Users Don’t Have To! 12

Thesauri in Context Why Control Vocabulary? So Your Users Don’t Have To! 12

Thesauri in Context Semantic Relationships Types 1. 2. 3. Equivalence Hierarchical Associative (Broader) United

Thesauri in Context Semantic Relationships Types 1. 2. 3. Equivalence Hierarchical Associative (Broader) United States 2 (Variant) Vt 1 (Variant) Green Mountain State (Preferred) Vermont 3 (Related) (Narrower) (Related) Skiing Burlington Maple Syrup 13

Thesauri in Context Levels of Control 14

Thesauri in Context Levels of Control 14

Thesauri in Context What is a Thesaurus? Traditional Use • Dictionary of synonyms (Roget’s)

Thesauri in Context What is a Thesaurus? Traditional Use • Dictionary of synonyms (Roget’s) • From one word to many words Information Retrieval Context • A controlled vocabulary in which equivalence, hierarchical, and associative relationships are identified for purposes of improved retrieval • Many words to one concept 15

Thesauri in Context Terminology Preferred Terms (UF subject headings, descriptors) SN Scope Notes UF

Thesauri in Context Terminology Preferred Terms (UF subject headings, descriptors) SN Scope Notes UF Used For BT Broader Term NT Narrower Term RT Related Terms (“See Also”) Variant Terms (UF non-preferred, entry terms) USE (“See”) 16

Thesauri in Context Types of Thesauri 17

Thesauri in Context Types of Thesauri 17

Thesauri in Context Visibility Classic Use • Both indexers and searchers explicitly map natural

Thesauri in Context Visibility Classic Use • Both indexers and searchers explicitly map natural language terms onto controlled vocabularies Web Environment • Able to choose level of visibility (implicit use, thesaural browsers) • Opportunity to educate users (terminology, associative learning) 18

Thesauri in Context Niche Applications (hypothetical example) 19

Thesauri in Context Niche Applications (hypothetical example) 19

Thesauri in Context Thesaurus Standards Mono-Lingual Thesauri • • • ISO 2788 (1974, 1985,

Thesauri in Context Thesaurus Standards Mono-Lingual Thesauri • • • ISO 2788 (1974, 1985, 1986, International) BS 5723 (1987, British) AFNOR NFZ 47 -100 (1981, French) DIN 1463 (1987 -1993, German) ANSI/NISO Z 39. 19 (1994, United States) Multi-Lingual Thesauri • ISO 5964 (1985, International) 20

Thesauri in Context ANSI/NISO Standard Z 39. 19 -1993 Guidelines for the Construction, Format,

Thesauri in Context ANSI/NISO Standard Z 39. 19 -1993 Guidelines for the Construction, Format, and Management of Monolingual Thesauri. 84 pp. ISBN: 1 -880124 -04 -1 Price: $49. 00 http: //www. niso. org/stantech. html Reasons to Follow Standard • Significant thinking behind guidelines • Technology integration • Cross-database compatibility 21

Thesauri in Context Oracle’s Perspective “The phrase…thesaurus standard is somewhat misleading. The computing industry

Thesauri in Context Oracle’s Perspective “The phrase…thesaurus standard is somewhat misleading. The computing industry considers a ‘standard’ to be a specification of behavior or interface. These standards do not specify anything. If you are looking for a thesaurus function interface, or a standard thesaurus file format, you won't find it here. Instead, these are guidelines for thesaurus compilers -- compiler being an actual human, not a program. What Oracle has done is taken the ideas in these guidelines and in ANSI Z 39. 19…and used them as the basis for a specification of our own creation…So, Oracle supports ISO-2788 relationships or ISO-2788 compliant thesauri. ” 22

Thesauri in Context A World in Transition “The majority of basic problems of thesaurus

Thesauri in Context A World in Transition “The majority of basic problems of thesaurus construction had already been solved by 1967. ” (Krooks and Lancaster, 1993) Traditional Thesauri Web Thesauri Print Online Academic / Library Business Expert / Repeat Users Novice / Infrequent Users Visible Invisible Accepted Value Unknown Value 23

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 24

Value of Thesauri IA Metrics • Cost of finding (time, clicks, frustration, precision). •

Value of Thesauri IA Metrics • Cost of finding (time, clicks, frustration, precision). • Cost of not finding (success, recall, frustration, alternatives). • Cost of development (time, budget, staff, frustration). • Value of learning (related products, services, projects, people). 25

Value of Thesauri KM Metrics • Revenue Generation (% revenues spent on KM, new

Value of Thesauri KM Metrics • Revenue Generation (% revenues spent on KM, new revenue generation) • Opportunity Cost (staff time, customers lost) • Knowledge Efficiency (faster product development, # mistakes made twice) • Data Quality (% knowledge on intranet, % email with attachments) • Intranet Usage (# hits, # contributions) • Individual Behavior (# citations) • Technical Performance (uptime, search response time) Working Council for Chief Information Officers Basic Principles of Information Architecture (http: //www. cio. executiveboard. com) 26

Value of Thesauri Web Site Statistics Wasted expense: most sites will waste between $1.

Value of Thesauri Web Site Statistics Wasted expense: most sites will waste between $1. 5 M and $2. 1 M on redesigns next year. Forfeited revenue: poorly architected retailing sites are underselling by as much as 50%. Lost customers: the sites we tested are driving away up to 40% of repeat traffic. Eroded brand: people who have a bad experience, typically tell 10 others. Forrester Research Why Most Web Sites Fail (Sept 98) 27

Value of Thesauri Intranet Statistics Employees spend 35% of productive time searching for information

Value of Thesauri Intranet Statistics Employees spend 35% of productive time searching for information online. Working Council for Chief Information Officers Basic Principles of Information Architecture (http: //www. cio. executiveboard. com) Managers spend 17% of their time (6 weeks a year) searching for information. Information Ecology Thomas Davenport and Lawrence Prusak (http: //argus-acia. com/content/review 001. html) 28

Value of Thesauri Intranet Statistics Sun Microsystems’ usability experts calculated that 21, 000 employees

Value of Thesauri Intranet Statistics Sun Microsystems’ usability experts calculated that 21, 000 employees were wasting an average of six minutes per day due to inconsistent intranet navigation structures. When lost time was multiplied by staff salaries, the estimated productivity loss exceeded $10 million per year. Jakob Nielsen Web Design and Development September 1997 29

Value of Thesauri Intranet Statistics After spending two years and $3 million on development

Value of Thesauri Intranet Statistics After spending two years and $3 million on development and usability testing, Bay Networks expects to see $10 million in productivity gains and a 10 percent cycle-time reduction for new product development as a result of its new information architecture. Working Council for Chief Information Officers Basic Principles of Information Architecture (http: //www. cio. executiveboard. com) 30

Value of Thesauri Intranet Statistics 40% of corporate users can’t find the information they

Value of Thesauri Intranet Statistics 40% of corporate users can’t find the information they need on their intranet. Prior to intranet reengineering in 1997, Ford conducted a survey of its 100, 000+ user base. Employees stated they could only find 15% of the information they needed to do their jobs. Under-investment in (unstructured) information. 80% spending on 20% (structured) data. Working Council for Chief Information Officers Basic Principles of Information Architecture (http: //www. cio. executiveboard. com) 31

Value of Thesauri Searching Problems “Most of the complaints we get are due to

Value of Thesauri Searching Problems “Most of the complaints we get are due to the way users search – they use the wrong keywords. ” - a manufacturing company “We have problems with the way customers enter queries. Capitalizations and misspellings give us headaches. ” - a software company Forrester Research Must Search Stink? (June 2000) 32

Value of Thesauri Searching Statistics “Search will become the center piece of navigation. ”

Value of Thesauri Searching Statistics “Search will become the center piece of navigation. ” 90% of firms rate search as very or extremely important. 52% don’t measure search effectiveness. Forrester Research Must Search Stink? (June 2000) 33

Value of Thesauri CV Statistics Researchers at Bell Labs found the probability that two

Value of Thesauri CV Statistics Researchers at Bell Labs found the probability that two people would choose the same word to describe an object to be less than 20%. Furnas, Landauer, et. al. , Bell Labs (1987) 30% of corporations systematically utilize metadata to classify information, while only one to three percent of companies populate those metadata tags using controlled vocabularies. 71% don’t account for misspellings or synonyms. Forrester Research Building an Intranet Portal (Jan 1999) 34

Value of Thesauri CV Statistics Principle of unlimited aliasing: by leveraging synonyms, recall went

Value of Thesauri CV Statistics Principle of unlimited aliasing: by leveraging synonyms, recall went from 20% to 80% (in a small collection). The Trouble with Computers Research study at Bellcore (Furnas et al. 1987) “The findings indicate that a hypertext index with multiple access points for each concept…led to greater effectiveness and efficiency of retrieval on almost all measures. ” A Usability Assessment of Online Indexing Structures By Carol A. Hert, Elin K. Jacob, and Patrick Dawson Journal of the American Society for Information Science (September 2000) 35

Value of Thesauri Complementary Approaches Basic • Navigation Design (Browsing) • Full Text Indexing

Value of Thesauri Complementary Approaches Basic • Navigation Design (Browsing) • Full Text Indexing (Searching) Advanced • Collaborative Filtering • Lexical Databases • Automated Hierarchy-Generation 36

Value of Thesauri Navigation Design Relationships • Global & Local (hierarchical) • Contextual (associative)

Value of Thesauri Navigation Design Relationships • Global & Local (hierarchical) • Contextual (associative) 37

Value of Thesauri Full Text Indexing Strengths • Enables high precision (exact phrase) •

Value of Thesauri Full Text Indexing Strengths • Enables high precision (exact phrase) • Enables high recall (word occurrence) Weaknesses • Often results in low precision (“aboutness”) • Often results in low recall (synonyms) Complementary Use • Provide users with option (search CV, full text) • Intelligent next step (no hits on CV > full text) • Full text search within CV search zones 38

Value of Thesauri Collaborative Filtering SN. Approaches that leverage knowledge about preferences or behaviors

Value of Thesauri Collaborative Filtering SN. Approaches that leverage knowledge about preferences or behaviors of people or organizations to facilitate information retrieval. Popularity / Importance • Direct Hit (analysis of searcher behavior) • Amazon (cross-title purchasing habits) • Google (citation indexing) Considerations • Favors established materials • Lacks benefits of vocabulary control • User-centric (ignores content, context) 39

Value of Thesauri Lexical Databases Scope Notes • Broad term banks or semantic networks

Value of Thesauri Lexical Databases Scope Notes • Broad term banks or semantic networks that specify lexical variants and term relationships. • General-interest, off-the-shelf thesauri. Examples • Roget’s Thesaurus • Word. Net • Plumb Design Visual Thesaurus 40

Value of Thesauri Lexical Databases Number of Terms (General, Niche) Importance of Context (Bug

Value of Thesauri Lexical Databases Number of Terms (General, Niche) Importance of Context (Bug in Software, Espionage) Word. Net # of Terms # of Meanings 50, 000 70, 000 Oxford English 615, 000 Dictionary 2. 4 M Notes > 20, 000 New Terms Per Year Named Insect Species 1. 4 M Drosophila UF Fruit Fly Square D Products 300, 000 Electrical Distribution 41

Value of Thesauri Hierarchy-Generation Software An Intimidating Vocabulary • Multivariate regression models, probabilistic Bayesian

Value of Thesauri Hierarchy-Generation Software An Intimidating Vocabulary • Multivariate regression models, probabilistic Bayesian models, neural networks, symbolic rule learning, computational semiotics, and support vector machines General Techniques • Clustering (similarity, word co-occurrence) • Vector Space (extract “meaning” from terms, teach by example) 42

Value of Thesauri Hierarchy-Generation Software Examples • Autonomy (http: //www. autonomy. com/) • Semio

Value of Thesauri Hierarchy-Generation Software Examples • Autonomy (http: //www. autonomy. com/) • Semio (http: //www. semio. com/) • Cartia (http: //www. cartia. com/) Hyperbole Autonomy claims their software eliminates "the need for any manual labor in the process. " 43

Value of Thesauri Hierarchy-Generation Software Considerations • No business context • No consideration of

Value of Thesauri Hierarchy-Generation Software Considerations • No business context • No consideration of users • No planning for future • Mixed category schemes • Hidden costs integration Ø rule design Ø training Ø Trends • Niche use (e. g. , news, web search results) • Integration with manual classification schemes 44

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 45

Methodology Overview Strategy Process Deliverables Design Build Consulting indicates special emphasis during this phase

Methodology Overview Strategy Process Deliverables Design Build Consulting indicates special emphasis during this phase 46

Methodology Strategy x Process Information Architect’s Toolbox * Business Context strategy meetings opinion leader

Methodology Strategy x Process Information Architect’s Toolbox * Business Context strategy meetings opinion leader interviews technology assessment Content & Applications content inventory content analysis metadata evaluation log analysis observation / usability testing interviews / affinity modeling heuristic evaluation classification scheme analysis benchmarking Users Existing IA * select right mix for project; this is a partial list of tools 47

Methodology Design x Deliverables Information Architect’s Toolbox * Organization & Labeling Navigation (Embedded) Navigation

Methodology Design x Deliverables Information Architect’s Toolbox * Organization & Labeling Navigation (Embedded) Navigation metadata specifications controlled vocabularies thesaurus primary taxonomy classification schemes blueprints and wireframes search system sitemap / indexes personalization / customization design / authoring guidelines content management policies functional specifications (Supplemental) Synthesis * select right mix for project; this is a partial list of tools 48

Methodology Consulting x Build Information Architect’s Toolbox * Metadata Application Point of Production Post

Methodology Consulting x Build Information Architect’s Toolbox * Metadata Application Point of Production Post Launch object-level support indexers indexing guides support thesaurus managers support designers / developers usability testing input / analysis fix problems metrics evaluation improvement * select right mix for project; this is a partial list of tools 49

Methodology Thesaurus Construction Strategy 1. Define Thesaurus Strategy 2. Develop Project Plan Design 3.

Methodology Thesaurus Construction Strategy 1. Define Thesaurus Strategy 2. Develop Project Plan Design 3. Gather Candidate Terms / Variants 4. Select Preferred Terms 5. Develop Facet Hierarchies 6. Identify ‘See Also’ Links 7. Write Design / Functional Specifications 8. Build / Buy Software Applications Build 9. Launch Indexing Operation 10. Refine Controlled Vocabularies 50

Methodology Strategy Questions • • • Does vocabulary control make sense? Where and for

Methodology Strategy Questions • • • Does vocabulary control make sense? Where and for what purposes? How will it align with business goals? How will it support users’ goals? How will it impact content management? Will we buy, borrow, or build? 51

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 52

Metadata Definition Information about information Purposes 1. Document surrogate (abstract) 2. Provides context (date,

Metadata Definition Information about information Purposes 1. Document surrogate (abstract) 2. Provides context (date, publisher) 3. Facilitates retrieval (subject) 53

Metadata Ways to Leverage User Interface • Generate browsable indexes (site -wide, sub-site, specialized

Metadata Ways to Leverage User Interface • Generate browsable indexes (site -wide, sub-site, specialized authority files) • Enable field-specific searching (filters, zones, sorting) • Support personalization (map profile to tags) Behind the Scenes • Enable efficient content management • Support decentralized tagging 54

Metadata Types of Indexing Manual Full Text Automated x complete text minus stop words

Metadata Types of Indexing Manual Full Text Automated x complete text minus stop words Keyword (Natural Language) humans assign “relevant” words and phrases software assigns “relevant” words and phrases Controlled Vocabulary humans map variants to preferred terms software maps variants to preferred terms 55

Metadata Full Text Indexing 56

Metadata Full Text Indexing 56

Metadata Keyword Indexing <HTML><HEAD> <TITLE>STARTREK. COM: The Official Star Trek Web Site!</TITLE> <META NAME='description'

Metadata Keyword Indexing <HTML><HEAD> <TITLE>STARTREK. COM: The Official Star Trek Web Site!</TITLE> <META NAME='description' CONTENT='STARTREK. COM: The Official Star Trek Web Site! The starting point for all Star Trek information on the web. '> <META NAME='keywords' CONTENT='star trek, enterprise, james kirk, mister spock, seven of nine, doctor mccoy, captain sulu, borg, klingon, romulan, ferengi, human, starfleet command, delta quadrant, alpha quadrant, gamma quadrant, excelsior, paramount, voyager, deep space nine, captain sisko, jean luc picard, kathryn janeway, starfleet academy, united federation of planets'> <META NAME='author' CONTENT='Paramount Digital Entertainment'> 57

Metadata CV Indexing Partners/Competitors UI ACCEPTED TERM LRID Variant Terms PC 0004 Bell Atlantic

Metadata CV Indexing Partners/Competitors UI ACCEPTED TERM LRID Variant Terms PC 0004 Bell Atlantic Bell. Atlantic; Bell Atlantic / North; NYNEX; Nynex PC 0091 NLG National Leisure Group PC 0076 VH 1 Video Hits 1; VH-1 58

Metadata Indexing Guidelines Considerations • Specificity: rule of specific entry • Exhaustivity: number of

Metadata Indexing Guidelines Considerations • Specificity: rule of specific entry • Exhaustivity: number of terms per document • Aboutness: strive for consistent interpretation • Consistency: can be more important than quality • Quality: balance against speed and consistency 59

Metadata Comparative Analysis Full Text (extraction) • High specificity enables precision (sometimes) • Exhaustivity

Metadata Comparative Analysis Full Text (extraction) • High specificity enables precision (sometimes) • Exhaustivity allows for high recall (sometimes) Keyword (assignment or extraction) • Relatively low level of investment • Selection of more relevant words / phrases may increase recall and precision (sometimes) Controlled Vocabulary (assignment) • Synonym management increases recall • Disambiguation increases precision (value increases with size, Medline > 6 M documents) • Enables hierarchical and “see also” browsing 60

Metadata Cost Analysis 61

Metadata Cost Analysis 61

Metadata Automated Indexing Primary Benefit • Save money (cost of manually classifying 1 journal

Metadata Automated Indexing Primary Benefit • Save money (cost of manually classifying 1 journal article = $1. 70) Approaches • Term Extraction: extraction of “important” words and phrases (proximity, stemming) • Latent Semantic Indexing: vector space approach (extracts meaning, training required) Desired Features • Assign terms from controlled vocabularies • Integrate with thesauri, database tools, etc. • Handle multi-lingual collections 62

Metadata Automated Indexing Software Categories & Labels Search Engines, Data Mining, Text Extraction, Knowledge

Metadata Automated Indexing Software Categories & Labels Search Engines, Data Mining, Text Extraction, Knowledge Management, Automatic Classification, Meta-Tagging Leading Products Metacode’s Metatagger (http: //www. metacode. com/) Mohomine (http: //www. mohomine. com/) Oingo (http: //www. oingo. com/) In. Xight Categorizer (http: //www. inxight. com/) Semio Taxonomy (http: //www. semio. com/) Inktomi / Ultraseek CCE (http: //www. inktomi. com/) 63

Metadata Selecting a Strategy Factors to Consider Manual Automated Cost (per document) High Low

Metadata Selecting a Strategy Factors to Consider Manual Automated Cost (per document) High Low Speed Slow Fast Consistency Variable High Quality Variable Multimedia-Capable Yes No Intelligent Yes No (understand text and guidelines) 64

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 65

Vocabulary Control Getting Started Types 1. 2. 3. Equivalence Hierarchical Associative (Broader) United States

Vocabulary Control Getting Started Types 1. 2. 3. Equivalence Hierarchical Associative (Broader) United States 2 (Variant) Vt 1 (Variant) Green Mountain State (Preferred) Vermont 3 (Related) (Narrower) (Related) Skiing Burlington Maple Syrup 66

Vocabulary Control Identify Terms Published Reference Materials Thesauri, classification schemes, encyclopedias, dictionaries, glossaries, indexes

Vocabulary Control Identify Terms Published Reference Materials Thesauri, classification schemes, encyclopedias, dictionaries, glossaries, indexes Content Representative sample of web site / intranet Users Search log analysis, surveys, interviews Experts Authors, subject experts 67

Vocabulary Control Organize Terms 1. 2. 3. 4. 5. Define preferred terms Link synonyms

Vocabulary Control Organize Terms 1. 2. 3. 4. 5. Define preferred terms Link synonyms and variants Group preferred terms by subject Identify broader and narrower terms Identify related terms 6. Note: steps 3 -5 are tentative designations and part of iterative process. 68

Vocabulary Control Form of Preferred Terms Grammatical Form (noun, adjective, verb) Spelling (defined authority,

Vocabulary Control Form of Preferred Terms Grammatical Form (noun, adjective, verb) Spelling (defined authority, house style) Singular & Plural Form (count nouns) Abbreviations & Acronyms (popular use) Considerations • Stemming helps (but not for mouse/mice) • Global guidelines / term-specific decisions • Rules simplify decision-making • Consistency enhances usability 69

Vocabulary Control Selection of Preferred Terms ANSI/NISO Z 39. 19 -1993 3. 0 “Literary

Vocabulary Control Selection of Preferred Terms ANSI/NISO Z 39. 19 -1993 3. 0 “Literary warrant (occurrence of terms in documents) is the guiding principle for selection of the preferred (term). ” 5. 2. 2 “Preferred terms should be selected to serve the needs of the majority of users. ” 70

Vocabulary Control Definition of Terms The meaning of the term must be deliberately restricted.

Vocabulary Control Definition of Terms The meaning of the term must be deliberately restricted. Qualifiers (manage homographs) Cells (biology) / Cells (electric) Scope Notes (restrict meaning) Hamburger. SN: includes burgers made with beef. Otherwise use “Turkey Burger” or “Veggie Burger” Definition (clarify and educate) Trend towards integration of glossaries 71

Vocabulary Control Variant Terms Variant terms provide the users with entry points into the

Vocabulary Control Variant Terms Variant terms provide the users with entry points into the vocabulary. Synonyms (same meaning) cats USE felines, helicopters USE whirlybirds Lexical Variants (different word forms) paediatrics USE pediatrics, BK USE Burger King Quasi-Synonyms (treated as equivalent) generic posting: beagle USE dog antonyms/continuum: wetness USE dryness 72

Vocabulary Control Recall and Precision 73

Vocabulary Control Recall and Precision 73

Vocabulary Control Term Specificity Assuming a good entry vocabulary, increased term specificity allows for

Vocabulary Control Term Specificity Assuming a good entry vocabulary, increased term specificity allows for improved precision without hurting recall (but costs grow fast). Vocabulary A United States Vocabulary B United States California San Diego 74

Vocabulary Control Compound Terms ANSI/NISO Z 39. 19. “Each descriptor…should represent a single concept.

Vocabulary Control Compound Terms ANSI/NISO Z 39. 19. “Each descriptor…should represent a single concept. ” ISO 2788. “It is a general rule that…compound terms should be factored (split) into simple elements. ” 75

Vocabulary Control Compound Terms Article: “Software for Information Architecture” 76

Vocabulary Control Compound Terms Article: “Software for Information Architecture” 76

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 77

Structure & Relationships Types • Bottom-up (semantic, term to term) • Top-down (shape, classification)

Structure & Relationships Types • Bottom-up (semantic, term to term) • Top-down (shape, classification) Semantic Relationships (reciprocity) • Equivalence • Hierarchical • Associative 78

Structure & Relationships Semantic Relationships (Broader) Cultural Landscapes (Synonym) (Preferred) (Variant) Human Settlements Inhabited

Structure & Relationships Semantic Relationships (Broader) Cultural Landscapes (Synonym) (Preferred) (Variant) Human Settlements Inhabited Places Settlements (Related) (Narrower) (Related) Housing Ghost Towns Dwellings 79

Structure & Relationships Semantic Relationships Equivalence • Use/Used For (USE/UF) • Leads from variants

Structure & Relationships Semantic Relationships Equivalence • Use/Used For (USE/UF) • Leads from variants to preferred e. g. , prams: USE baby carriages 80

Structure & Relationships Semantic Relationships Hierarchical • Broader Term/Narrower Term (BT/NT) Types • Generic

Structure & Relationships Semantic Relationships Hierarchical • Broader Term/Narrower Term (BT/NT) Types • Generic (class/species, inheritance) Vertebrata NT Amphibia • Whole-Part (associative unless exclusive) Ear NT Vestibular Apparatus • Instance (proper name) Seas NT Mediterranean Sea 81

Structure & Relationships Semantic Relationships Associative • Related Term (RT, See Also) • Non-hierarchical

Structure & Relationships Semantic Relationships Associative • Related Term (RT, See Also) • Non-hierarchical and non-equivalent • Relation should be “strongly implied” e. g. , hammers RT nails 82

Structure & Relationships Associative Relationships Examples Field of Study and Object of Study •

Structure & Relationships Associative Relationships Examples Field of Study and Object of Study • Forestry RT Forests Process and its Agent • Temperature Control RT Thermostat Concepts and their Properties • Poisons RT Toxicity Action and Product of Action • Weaving RT Cloth Concepts Linked by Causal Dependence • Bereavement RT Death 83

Structure & Relationships Classification Schemes SN Hierarchical arrangement of terms. In navigation context, use

Structure & Relationships Classification Schemes SN Hierarchical arrangement of terms. In navigation context, use Hierarchy. UF Categorization Taxonomy Ontology RT Hierarchy 84

Structure & Relationships Pre- & Post-Coordination Enumerative Classification Schemes • Pre-coordinate (more compound terms)

Structure & Relationships Pre- & Post-Coordination Enumerative Classification Schemes • Pre-coordinate (more compound terms) • All terms are enumerated (listed) in their entirety in the scheme. Library of Congress Classification Scheme Synthetic Classification Schemes • Post-coordinate (more uni-terms) • New terms can be created by combining terms during a search (AND). Art & Architecture Thesaurus 85

Structure & Relationships Pre- & Post. Coordination • In the highly enumerative LC Classification,

Structure & Relationships Pre- & Post. Coordination • In the highly enumerative LC Classification, “Groundwater - - Pollution” and “Soil pollution” are dispersed at indexing (high precision, low recall). • Keyword searching improves recall, hurts precision (a synthetic band-aid, potential false drop on “soil purification standards”). 86

Structure & Relationships Polyhierarchy Strict Hierarchies • Each term appears in only one place

Structure & Relationships Polyhierarchy Strict Hierarchies • Each term appears in only one place in the hierarchy. • Essential for placement of physical objects. Polyhierarchies • Terms cross-listed in multiple categories. • Accepts complex nature of reality. 87

Structure & Relationships Polyhierarchy Medical Subject Headings (Me. SH) • Compound terms needed to

Structure & Relationships Polyhierarchy Medical Subject Headings (Me. SH) • Compound terms needed to manage 6 million documents in Medline. • High level of -coordination polyhierarchy. • Terms may have more than one BT. pre forces 88

Structure & Relationships Faceted Classification Overview • Invented by S. R. Ranganathan (1930 s)

Structure & Relationships Faceted Classification Overview • Invented by S. R. Ranganathan (1930 s) • Handle complex subjects (reality) • One principle of division at a time • Multiple “pure” taxonomies • UF analytico-synthetic scheme, fielded database Facets • Fundamental facets: personality, matter, energy, space, time • Common facets: subject (about), geography (in), author (by whom) Art & Architecture Thesaurus, ASIS Thesaurus 89

Structure & Relationships Facets, Coordination, Specificity 90

Structure & Relationships Facets, Coordination, Specificity 90

Structure & Relationships Yahoo Characteristics • Single Facet (a topical hierarchy) • Fairly Enumerative

Structure & Relationships Yahoo Characteristics • Single Facet (a topical hierarchy) • Fairly Enumerative (search on “Boston” finds 45 categories including: Boston Celtics, Boston Tea Party, Anonymous Account of the Boston Massacre) • Polyhierarchical (Computer Science@ listed under Computers & Internet and Science) Observations • Huge number of categories and levels (unwieldy) • Fits user expectations (where do I find this? ) 91

Structure & Relationships ASIS Thesaurus Characteristics • Faceted (16 facets including document types, fields

Structure & Relationships ASIS Thesaurus Characteristics • Faceted (16 facets including document types, fields and disciplines, organizations, qualities) • Fairly Synthetic (large percentage of one or two word single-concept descriptors) • Polyhierarchical (machine aided indexing BT computer applications, BT indexing) Observations • Faceted approach allows small number of terms to be combined in large number of unexpected ways (e. g. , ambiguity and informatics) • Presentation is not accessible to typical user 92

Structure & Relationships A Unification Theory Taxonomy single facet, enumerative Thesaurus faceted, synthetic fits

Structure & Relationships A Unification Theory Taxonomy single facet, enumerative Thesaurus faceted, synthetic fits user expectations (where did they put this? ) fits content complexity (how can I describe this? ) use for top few levels (familiar gateway to site) populate the hierarchy (combinations, see also) early user tests (best primary hierarchy) ongoing user tests (leverage power, flexibility) application of human expertise human-software hybrid (facet-specific solutions) Hypothesis: This hybrid information architecture will become a common model for web sites and intranets over the next several years. 93

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 94

Thesaurus Management What’s Involved? • Software, workflow, quality control • Vocabularies evolve over time

Thesaurus Management What’s Involved? • Software, workflow, quality control • Vocabularies evolve over time • Impacts authors, indexers, users Vocabulary Maintenance Tasks • Add, delete, enhance, normalize terms • Overall evaluation 95

Thesaurus Management Software: What to Look For • Traditional database functionality • Compliant with

Thesaurus Management Software: What to Look For • Traditional database functionality • Compliant with standards (ANSI, ISO) • Relationship control (reciprocity, validation, orphan identification) • Term status (proposed, provisional, accepted) • Flexible output (alphabetical, hierarchical) • Integration with related tools and tasks (indexing, searching, browsing) Willpower’s List of Thesaurus Software http: //www. willpower. demon. co. uk/thessoft. htm 96

Thesaurus Management Software: What You’ll Find Thesaurus Management Software • Standards-compliant, sophisticated, • Poor

Thesaurus Management Software: What You’ll Find Thesaurus Management Software • Standards-compliant, sophisticated, • Poor integration (library-centric) • Examples: Lexico, Multi. Tes Database Management Software • Strong integration • Less thesaurus-specific functionality • Examples: Oracle (inter. Media), Sybase (English Wizard) 97

Thesaurus Management Software What You’ll Find Search Engines • Watch for casual use of

Thesaurus Management Software What You’ll Find Search Engines • Watch for casual use of “thesaurus” • Look for integration with browsing. Ultraseek Thesaurus Expansion for Queries: Administrators may put sets of synonyms in thesaurus. txt file…When a query matches one of the terms in that file, the synonyms will automatically appear, so the user has the option to add it to the query. Verity's core search products include the following advanced knowledge retrieval capabilities: advanced query expansion and disambiguation tools, including linguistic stemming and thesaurus expansion. 98

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 99

Case Study Call Center Intranet Introduction • KM application • 6, 000 users (customer

Case Study Call Center Intranet Introduction • KM application • 6, 000 users (customer care associates) • 8, 000 documents (hierarchy, search) • 6 month project (10/97 to 4/98) • $500 K of $10 M redesign Goals • Reduce training time / time to find • Increase use / customer satisfaction 100

Case Study: Call Center Intranet Process Overview Strategy • Background, vocabulary, meetings, observation •

Case Study: Call Center Intranet Process Overview Strategy • Background, vocabulary, meetings, observation • 4 weeks x 2. 5 PM + 1 IA Design • Bottom-up focus (doc types, fields, templates) • 4 weeks x 2 PM + 2 IA • 4 weeks x 1 IA (during implementation) Implementation • Indexing / develop controlled vocabularies • Specifications (authors, indexers, developers) • 16 weeks x 4 indexers + 1 IA + 2 PM + 1 subject expert 101

Case Study: Call Center Intranet Controlled Vocabularies Primary Vocabularies • Partners/Competitors (122) • Plans/Promotions

Case Study: Call Center Intranet Controlled Vocabularies Primary Vocabularies • Partners/Competitors (122) • Plans/Promotions (173) • Products/Services (151 / 184 variants) • Geographic Codes (51) Secondary Vocabularies • Adjustment Codes (36) • Corporate Terminology (70) • Time Codes (12) 102

Case Study: Call Center Intranet Primary Vocabularies Partners/Competitors UI ACCEPTED TERM LRID Variant Terms

Case Study: Call Center Intranet Primary Vocabularies Partners/Competitors UI ACCEPTED TERM LRID Variant Terms PC 0004 Bell Atlantic Bell. Atlantic; Bell Atlantic / North; NYNEX; Nynex PC 0091 NLG National Leisure Group PC 0076 VH 1 Video Hits 1; VH-1 103

Case Study: Call Center Intranet Primary Vocabularies Products/Services UI Accepted Term LRID Variant Terms

Case Study: Call Center Intranet Primary Vocabularies Products/Services UI Accepted Term LRID Variant Terms PS 0135 Access Dialing 10 -288; 10 -322; dial around PS 0006 Air Miles Air. Miles PS 0151 XYZ Direct USADirect; XYZ USA Direct; XYZDirect card 104

Case Study: Call Center Intranet Primary Vocabularies Geographic Codes CT Connecticut DE Delaware DC

Case Study: Call Center Intranet Primary Vocabularies Geographic Codes CT Connecticut DE Delaware DC District of Columbia; Dist. Columbia Note: Continental U. S. is equivalent to the lower 48 states. 105

Case Study: Call Center Intranet Secondary Vocabularies Adjustment Codes DAK Denies All Knowledge -

Case Study: Call Center Intranet Secondary Vocabularies Adjustment Codes DAK Denies All Knowledge - MOS Monthly Service Charge Mnthly. Service Charge; Mnthly. Svc. Charge; Monthly Svc. Charge WNO Wrong Number - WTN Working Telephone Number Working Tele. Number 106

Case Study: Call Center Intranet Secondary Vocabularies Corporate Terminology Billed Telephone Number (BTN) Billed

Case Study: Call Center Intranet Secondary Vocabularies Corporate Terminology Billed Telephone Number (BTN) Billed Tele. Number Cross Boundary Account Foreign Account Fraud - Multi Level Marketing Multi-Level Marketing; Multi. Level Marketing; MLM World Wide Web WWW; World. Wide. Web 107

Case Study: Call Center Intranet Blueprints 108

Case Study: Call Center Intranet Blueprints 108

Case Study: Call Center Intranet Wireframes: Content 109

Case Study: Call Center Intranet Wireframes: Content 109

Case Study: Call Center Intranet Wireframes: Browsable Index Provides ability to view all documents

Case Study: Call Center Intranet Wireframes: Browsable Index Provides ability to view all documents tagged with same preferred term. Ability to combine fields for powerful search/browse. 110

Case Study: Call Center Intranet Deliverables Overview • • • Blueprints and Wireframes Controlled

Case Study: Call Center Intranet Deliverables Overview • • • Blueprints and Wireframes Controlled Vocabularies Authoring & Indexing Guidelines Indexed Documents (4, 000) Functional Specifications Documentation & Training 111

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Section Break I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 112

Related Topics Multi-Lingual Thesauri Concepts • Source / Target Language • Degrees of Equivalence

Related Topics Multi-Lingual Thesauri Concepts • Source / Target Language • Degrees of Equivalence • Localization, not Globalization Facts (from The Mother Tongue by Bill Bryson) • There are now more students of English in China than there are people in the United States • The French can’t distinguish house and home • Finnish has 15 case forms (noun variants) • The Eskimos have 50 words for types of snow but no word that just means snow • A blizzard in England is a flurry in Nebraska 113

Related Topics The List Goes On… Thesauri AND • Business Strategy • Content Management

Related Topics The List Goes On… Thesauri AND • Business Strategy • Content Management • Markup Languages • Notation • XML 114

Seminar Review I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata

Seminar Review I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics 115

How To Learn More Argus Center for Information Architecture Web Site http: //argus-acia. com

How To Learn More Argus Center for Information Architecture Web Site http: //argus-acia. com Email Newsletter Strange Connections, Events, Interviews Thesaurus Resources & Examples http: //argus-acia. com/seminars/ user name and password both = “lajolla” 116

Contact Us Argus Associates, Inc. 912 North Main Street Ann Arbor, Michigan 48104 (734)

Contact Us Argus Associates, Inc. 912 North Main Street Ann Arbor, Michigan 48104 (734) 913 -0010 Sales sales@argus-inc. com Employment http: //argus-inc. com/recruiting/ Web Sites http: //argus-inc. com/ http: //argus-acia. com/ 117