CS 276 A Text Information Retrieval Mining and

  • Slides: 68
Download presentation
CS 276 A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002

CS 276 A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002

Recap: Relevance Feedback n n Rocchio Algorithm: Typical weights: alpha = 8, beta =

Recap: Relevance Feedback n n Rocchio Algorithm: Typical weights: alpha = 8, beta = 64, gamma = 64 Tradeoff alpha vs beta/gamma: If we have a lot of judged documents, we want a higher beta/gamma. But we usually don’t …

Pseudo Feedback initial query apply relevance feedback retrieve documents label top k docs relevant

Pseudo Feedback initial query apply relevance feedback retrieve documents label top k docs relevant top k documents

Pseudo-Feedback: Performance

Pseudo-Feedback: Performance

Today’s topics n n n User Interfaces Browsing Visualization

Today’s topics n n n User Interfaces Browsing Visualization

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system Receive results Information need User Explore results no Done? yes Stop

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system Receive results Information need User Explore results no Focus of most IR! Done? yes Stop

Information Access in Context Information Access Analyze Synthesize High-Level Goal User Done? no yes

Information Access in Context Information Access Analyze Synthesize High-Level Goal User Done? no yes Stop

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system Receive results Information need User Explore results no Done? yes Stop

Starting points n Source selection n n Highwire press Lexis-nexis Google! Overviews n n

Starting points n Source selection n n Highwire press Lexis-nexis Google! Overviews n n n Directories/hierarchies Visual maps Clustering

Highwire Press Source Selection

Highwire Press Source Selection

Hierarchical browsing Level 0 Level 1 Level 2

Hierarchical browsing Level 0 Level 1 Level 2

Visual Browsing: Themescape

Visual Browsing: Themescape

Browsing Starting point x x x Answer Credit: William Arms, Cornell x x x

Browsing Starting point x x x Answer Credit: William Arms, Cornell x x x

Scatter/Gather n n Scatter/gather allows the user to find a set of documents of

Scatter/Gather n n Scatter/gather allows the user to find a set of documents of interest through browsing. Take the collection and scatter it into n clusters. Pick the clusters of interest and merge them. Iterate

Scatter/Gather

Scatter/Gather

Scatter/gather

Scatter/gather

How to Label Clusters n Show titles of typical documents n n Titles are

How to Label Clusters n Show titles of typical documents n n Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster n n n More likely to fully represent cluster Use distinguishing words/phrases But harder to scan

Visual Browsing: Hyperbolic Tree

Visual Browsing: Hyperbolic Tree

Visual Browsing: Hyperbolic Tree

Visual Browsing: Hyperbolic Tree

Study of Kohonen Feature Maps n H. Chen, A. Houston, R. Sewell, and B.

Study of Kohonen Feature Maps n H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS n Comparison: Kohonen Map and Yahoo n n 49(7) Task: n “Window shop” for interesting home page n Repeat with other interface Results: n Starting with map could repeat in Yahoo (8/11) n Starting with Yahoo unable to repeat in map (2/14) UWMS Data Mining Workshop Credit: Marti Hearst

Study (cont. ) n Participants liked: n n n Correspondence of region size to

Study (cont. ) n Participants liked: n n n Correspondence of region size to # documents Overview (but also wanted zoom) Ease of jumping from one topic to another Multiple routes to topics Use of category and subcategory labels UWMS Data Mining Workshop Credit: Marti Hearst

Study (cont. ) n Participants wanted: n n n n n hierarchical organization other

Study (cont. ) n Participants wanted: n n n n n hierarchical organization other ordering of concepts (alphabetical) integration of browsing and search corresponce of color to meaning more meaningful labels at same level of abstraction fit more labels in the given space combined keyword and category search multiple category assignment (sports+entertain) UWMS Data Mining Workshop Credit: Marti Hearst

Browsing n Effectiveness depends on n n Starting point Ease of orientation (are similar

Browsing n Effectiveness depends on n n Starting point Ease of orientation (are similar docs “close” etc, intuitive organization) How adaptive system is Compare to physical browsing (library, grocery store)

Searching vs. Browsing n Information need dependent n n n User dependent n n

Searching vs. Browsing n Information need dependent n n n User dependent n n Open-ended (find an interesting quote on the virtues of friendship) -> browsing Specific (directions to Pacific Bell Park) -> searching Some users prefer searching, others browsing (confirmed in many studies: some hate to type) You don’t need to know vocabulary for browsing. System dependent (some web sites don’t support search) Searching and browsing are often interleaved.

Searchers vs. Browsers n n 1/3 of users do not search at all 1/3

Searchers vs. Browsers n n 1/3 of users do not search at all 1/3 rarely search (or urls only) Only 1/3 understand the concept of search (ISP data from 2000)

Exercise n Observe your own information seeking behavior n n n WWW University library

Exercise n Observe your own information seeking behavior n n n WWW University library Grocery store Are you a searcher or a browser? How do you reformulate your query? n n Read bad hits, then minus terms Read good hits, then plus terms Try a completely different query …

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system Receive results Information need User Explore results no Done? yes Stop

Query Specification n Recall: n n n n Relevance feedback Query expansion Spelling correction

Query Specification n Recall: n n n n Relevance feedback Query expansion Spelling correction Query-log mining based Interaction styles for query specification Queries on the Web Parametric search Term browsing

Query Specification: Interaction Styles n Shneiderman 97 n n n Command Language Form Fillin

Query Specification: Interaction Styles n Shneiderman 97 n n n Command Language Form Fillin Menu Selection Direct Manipulation Natural Language Example: n How do each apply to Boolean Queries Credit: Marti Hearst

Command-Based Query Specification n command attribute value connector … n n find pa shneiderman

Command-Based Query Specification n command attribute value connector … n n find pa shneiderman and tw user# What are the attribute names? What are the command names? What are allowable values? Credit: Marti Hearst

Form-Based Query Specification (Altavista) Credit: Marti Hearst

Form-Based Query Specification (Altavista) Credit: Marti Hearst

Form-Based Query Specification (Melvyl) Credit: Marti Hearst

Form-Based Query Specification (Melvyl) Credit: Marti Hearst

Form-based Query Specification (Infoseek) Credit: Marti Hearst

Form-based Query Specification (Infoseek) Credit: Marti Hearst

Direct Manipulation Spec. VQUERY (Jones 98) Credit: Marti Hearst

Direct Manipulation Spec. VQUERY (Jones 98) Credit: Marti Hearst

Menu-based Query Specification (Young & Shneiderman 93) Credit: Marti Hearst

Menu-based Query Specification (Young & Shneiderman 93) Credit: Marti Hearst

Query Specification/Reformulation n n A good user interface makes it easy for the user

Query Specification/Reformulation n n A good user interface makes it easy for the user to reformulate the query Challenge: one user interface is not ideal for all types of information needs

Types of Information Needs n n n Need answer to question (who won the

Types of Information Needs n n n Need answer to question (who won the game? ) Re-find a particular document Find a good recipe for tonight’s dinner Authoritative summary of information (HIV review) Exploration of new area (browse sites about Baja)

Queries on the Web Most Frequent on 2002/10/26

Queries on the Web Most Frequent on 2002/10/26

Queries on the Web (2000)

Queries on the Web (2000)

Intranet Queries (Aug 2000) n n n n 3351 bearfacts 3349 telebears 1909 extension

Intranet Queries (Aug 2000) n n n n 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map n n n n 773 741 738 721 716 667 627 602 582 577 563 550 543 470 bookstore class+pass housing tele-bears directory schedule recipes transcripts tuition seti registrar info+bears class+schedule financial+aid Source: Ray Larson

Intranet Queries n Summary of sample data from 3 weeks of UCB queries n

Intranet Queries n Summary of sample data from 3 weeks of UCB queries n n n 13. 2% Telebears/Bear. Facts/Info. Bears/Bear. Link (12297) 6. 7% Schedule of classes or final exams (6222) 5. 4% Summer Session (5041) 3. 2% Extension (2932) 3. 1% Academic Calendar (2846) 2. 4% Directories (2202) 1. 7% Career Center (1588) 1. 7% Housing (1583) 1. 5% Map (1393) Average query length over last 4 months: 1. 8 words This suggests what is difficult to find from the home page Source: Ray Larson

Query Specification: Feast or Famine Feast Specifying a well targeted query is hard. Bigger

Query Specification: Feast or Famine Feast Specifying a well targeted query is hard. Bigger problem for Boolean. Famine

Parametric search n Each document has, in addition to text, some “meta-data” e. g.

Parametric search n Each document has, in addition to text, some “meta-data” e. g. , n n n Language = French Format = pdf Subject = Physics etc. Date = Feb 2000 A parametric search interface allows the user to combine a full-text query with selections on these parameters e. g. , n language, date range, etc.

Parametric search example Notice that the output is a (large) table. Various parameters in

Parametric search example Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

Parametric search example We can add text search.

Parametric search example We can add text search.

Interfaces for term browsing

Interfaces for term browsing

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system

The User in Information Access Find starting point Formulate/ Reformulate Query Send to system Receive results Information need User Explore results no Done? yes Stop

Explore Results n Determine: Do these results answer my question? n n Summarization More

Explore Results n Determine: Do these results answer my question? n n Summarization More generally: provide context Hypertext navigation: Can I find the answer by following a link? Browsing and clustering (again) n Browse to explore results

Explore Results: Context n n We can’t present complete documents in the result set

Explore Results: Context n n We can’t present complete documents in the result set – too much information. Present information about each doc n n n Must be concise (so we can show many docs) Must be informative Typical information about each document n n Summary Context of query words Meta data: date, author, language, file name/url Context of document in collection

Context in Collection: Cha-Cha

Context in Collection: Cha-Cha

Category Labels n Advantages: n n n Interpretable Capture summary information Describe multiple facets

Category Labels n Advantages: n n n Interpretable Capture summary information Describe multiple facets of content Domain dependent, and so descriptive Disadvantages n n n Do not scale well (for organizing documents) Domain dependent, so costly to acquire May mis-match users’ interests Credit: Marti Hearst

Evaluate Results Context in Hierarchy: Cat-a-Cone

Evaluate Results Context in Hierarchy: Cat-a-Cone

Explore Results: Summarization n Query-dependent summarization n n KWIC (keyword in context) lines (a

Explore Results: Summarization n Query-dependent summarization n n KWIC (keyword in context) lines (a la google) Query-independent summarization n n Summary written by author (if available) Exploit genre (news stories) Sentence extraction Natural language generation

Evaluate Results Structure of document: See. Soft

Evaluate Results Structure of document: See. Soft

Personalization Outride Personalized Search System User Query Interests Query Augmentation Intranet Search Demographics Result

Personalization Outride Personalized Search System User Query Interests Query Augmentation Intranet Search Demographics Result Processing Click Stream Search History Result Set Web Search Application Usage Outride Side Bar Interface Outride Schema User x Content x History x Demographics Search Engine Schema Keyword x Doc ID x Link Rank

How Long to Get an Answer? Average Task Completion Time in Seconds SOURCE: ZDLabs/e.

How Long to Get an Answer? Average Task Completion Time in Seconds SOURCE: ZDLabs/e. Testing, Inc. October 2000

SOURCE: ZDLabs/e. Testing, Inc. October 2000

SOURCE: ZDLabs/e. Testing, Inc. October 2000

Novices versus Experts Time (Seconds) (Average Time to Complete Task) User Skill Level SOURCE:

Novices versus Experts Time (Seconds) (Average Time to Complete Task) User Skill Level SOURCE: ZDLabs/e. Testing, Inc. October 2000

Performance of Interactive Retrieval

Performance of Interactive Retrieval

Boolean Queries: Interface Issues n n n Boolean logic is difficult for the average

Boolean Queries: Interface Issues n n n Boolean logic is difficult for the average user. Much research was done on interfaces facilitating the creation of boolean queries by non-experts. Much of this research was made obsolete by the web. Current view is that non-expert users are best served with non-boolean or simple +/boolean (pioneered by altavista). But boolean queries are the standard for certain groups of expert users (eg, lawyers).

User Interfaces: Other Issues n Technical HCI issues n n n How to use

User Interfaces: Other Issues n Technical HCI issues n n n How to use screen real estate One monolithic window or many? Undo operator Give access to history Alternative interfaces for novel/expert users Disabilities

Take-Away n n n Don’t ignore the user in information retrieval. Finding matching documents

Take-Away n n n Don’t ignore the user in information retrieval. Finding matching documents for a query is only part of information access and “knowledge work”. In addition to core information retrieval, information access interfaces need to support n n n Finding starting points Formulation/reformulation of queries Exploring/evaluating results

Exercise n n Current information retrieval user interfaces are designed for typical computer screens.

Exercise n n Current information retrieval user interfaces are designed for typical computer screens. How would you design a user interface for a wall-size screen?

Resources MIR Ch. 10. 0 – 10. 7 Donna Harman, Overview of the fourth

Resources MIR Ch. 10. 0 – 10. 7 Donna Harman, Overview of the fourth text retrieval conference (TREC 4), National Institute of Standards and Technology. Cutting, Karger, Pedersen, Tukey. Scatter/Gather. ACM SIGIR. Hearst, Cat-a-cone, an interactive interface for specifying searches and viewing retrieving results in a large category hierarchy, ACM SIGIR.