SIMS 296 a3 UI Background Marti Hearst Fall
SIMS 296 a-3: UI Background Marti Hearst Fall ‘ 98
Interface Topics Today n (Other topics will be covered later) n Supporting the Dynamic Continuing Process of Search n Search Starting Points Marti Hearst UCB SIMS, Fall 98
Human Information Seeking Behavior Marti Hearst UCB SIMS, Fall 98
Standard Model n Assumptions: n Maximizing precision and recall simultaneously n The information need remains static n The value is in the resulting document set Marti Hearst UCB SIMS, Fall 98
User’s Information Need Collections Pre-process text input Parse Query Index Rank or Match Query Reformulation Marti Hearst UCB SIMS, Fall 98
“Berry-Picking” as an Information Seeking Strategy (Bates 90) n n Standard IR model n The information need remains the same throughout the search session. n Goal is to produce a perfect set of relevant docs. Berry-picking model n The query is continually shifting. n Users may move through a variety of sources. n New information may yield new ideas and new directions. n The value of search is on the bits and pieces picked up along the. Marti way. Hearst UCB SIMS, Fall 98
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need. ” (after Bates 90) Q 2 Q 1 Q 4 Q 3 Q 5 Q 0 Marti Hearst UCB SIMS, Fall 98
Implications n n n Interfaces should make it easy to store intermediate results Interfaces should make it easy to follow trails with unanticipated results Difficulties with evaluation Marti Hearst UCB SIMS, Fall 98
Supporting the Information Seeking Process n Two recent similar approaches that focus on supporting the process n Sketch. Trieve (Hendry & Harper 97) n DLITE (Cousins 97) Marti Hearst UCB SIMS, Fall 98
Informal Interface n n n Informal does not mean less useful Show the search is n unfolding or evolving n expanding or contracting Prompt the user to n reformulate and abandon plans n backtrack to points of task deferral n make side-by-side comparisons n define and discuss problems Marti Hearst UCB SIMS, Fall 98
Sketch. Trieve: An Informal Interface (Hendry & Harper 96, 97) n n n A “spreadsheet” for information access Make use of layout, space, and locality n comprehension and explanation n search planning A data-flow notation for information seeking n link sources to queries n link both to retrieved documents n align results in space for comparison Marti Hearst UCB SIMS, Fall 98
Sketch. Trieve: Connecting Results with Next Query Marti Hearst UCB SIMS, Fall 98
DLITE n n n (Cousins 97) Drag and Drop interface Reify queries, sources, retrieval results Animation to keep track of activity Marti Hearst UCB SIMS, Fall 98
Starting Points for Search n Faced with a prompt or an empty entry form … how to start? n Lists of sources n Overviews Clusters n Category Hierarchies/Subject Codes n Co-citation Links n n Examples n Automatic source selection Marti Hearst UCB SIMS, Fall 98
List of Sources n n Have to guess based on the name Requires prior exposure/experience Marti Hearst UCB SIMS, Fall 98
Marti Hearst UCB SIMS, Fall 98
Overviews in the User Interface n n n Unsupervised Groupings n Clustering n Kohonen Feature Maps Supervised Categories n Yahoo! n Superbook n Hi. Browse n Cat-a-Cone Combinations n Dyna. Cat n SONIA Marti Hearst UCB SIMS, Fall 98
Text Clustering n n n Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others Marti Hearst UCB SIMS, Fall 98
Text Clustering is “The art of finding groups in data. ” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall 98
Text Clustering is “The art of finding groups in data. ” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall 98
Document/Document Matrix Marti Hearst UCB SIMS, Fall 98
Agglomerative Clustering A B C D E Marti Hearst UCB SIMS, Fall 98 F G H I
Agglomerative Clustering A B C D E Marti Hearst UCB SIMS, Fall 98 F G H I
Agglomerative Clustering A B C D E Marti Hearst UCB SIMS, Fall 98 F G H I
K-Means Clustering n n 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering n take a small sample n group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary Marti Hearst UCB SIMS, Fall 98
The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests. ” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents. ” van Rijsbergen 1979 Marti Hearst UCB SIMS, Fall 98
Clustering as Categorization “In a traditional library environment … the items are classified first into subject areas, and a search is restricted to times within a few chosen subject classes. The same device can also be used … [to construct] groups of related documents and confining the search to certain groups only. ” Salton 71 Marti Hearst UCB SIMS, Fall 98
Clustering as Categorization “… In experiments we often want to vary the cluster representatives at search time. … Of course, were we to design an operational classification, the cluster representatives would be constructed once and for all at cluster time. van Rijsbergen 79 Marti Hearst UCB SIMS, Fall 98
Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 n Cluster sets of documents into general “themes”, like a table of contents n Display the contents of the clusters by showing topical terms and typical titles n User chooses subsets of the clusters and re-clusters the documents within n Resulting new groups have different “themes” Marti Hearst UCB SIMS, Fall 98
query Collection Rank Cluster Marti Hearst UCB SIMS, Fall 98
S/G Example: query on “star” Encyclopedia text 8 symbols 68 film, tv (p) 97 astrophysics 67 astronomy(p) 10 flora/fauna 14 sports 47 film, tv 7 music 12 steller phenomena 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated Marti Hearst UCB SIMS, Fall 98
Marti Hearst UCB SIMS, Fall 98
Marti Hearst UCB SIMS, Fall 98
Marti Hearst UCB SIMS, Fall 98
Two Queries: Two Clusterings AUTO, CAR, ELECTRIC AUTO, CAR, SAFETY 8 control drive accident … 6 control inventory integrate … 25 battery california technology … 10 investigation washington … 48 import j. rate honda toyota … 12 study fuel death bag air … 16 export international unit japan 61 sale domestic truck import … 3 service employee automatic … 11 japan export defect unite … The main differences are the clusters that are central to the query Marti Hearst UCB SIMS, Fall 98
Publication History of Scatter/Gather (Publication timing may lag significantly behind when the work was done) n n n n 1991 Patents Filed SIGIR 92 Initial Algorithm Introduced SIGIR 93 Optimizations Presented AAAIFS 95 Examples of Use on Retrieval Results TREC 95 Use in Interactive Track Experiments CHI 96 Experiments providing evidence that users learn collection structure SIGIR 96 Evidence that clustering can improve ranking for TREC-like scenario Marti Hearst UCB SIMS, Fall 98
Another use of clustering n n Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2 D graphical representation: Marti Hearst UCB SIMS, Fall 98
Clustering Multi-Dimensional Document Space (image from Wise et al 95) Marti Hearst UCB SIMS, Fall 98
Concept “Landscapes” Disease Pharmocology Anatomy Legal Hospitals Built using Kohonen Feature Maps Xia Lin, H. C. Chen Marti Hearst UCB SIMS, Fall 98
Visualization of Clusters n Huge 2 D maps may be inappropriate focus for information retrieval Can’t see what documents are about n Documents forced into one position in semantic space n Space is difficult to use for IR purposes n Hard to view titles n n Perhaps n more suited for pattern discovery problem: often only one view on the space Marti Hearst UCB SIMS, Fall 98
Using Clustering in Document Ranking n n n Cluster entire collection Find cluster centroid that best matches the query This has been explored extensively n it is expensive n it doesn’t work well Marti Hearst UCB SIMS, Fall 98
Using Clustering in Interfaces n n n Alternative (scatter/gather): n cluster top-ranked documents n show cluster summaries to user Seems useful n experiments show relevant docs tend to end up in the same cluster n users seem able to interpret and use the cluster summaries some of the time More computationally feasible Marti Hearst UCB SIMS, Fall 98
Clustering n n Advantages: n Sometimes discover meaningful themes n Data-driven, so reflect emphases present in the collection of documents n Can differentiate heterogeneous collections n Domain independent Disadvantages n Variability in quality of results n Only one view on documents’ themes n Not good at differentiating homogenous collections n Require interpretation n May mis-match users’ interests Marti Hearst UCB SIMS, Fall 98
Incorporating Categories into the Interface n n Yahoo is the standard method Problems: n Hard to search, meant to be navigated. n Only one category per document (usually) Marti Hearst UCB SIMS, Fall 98
Marti Hearst UCB SIMS, Fall 98
Integrated Browsing & Search n Search for category labels n Browse category labels n Search within document collection n Browse resulting documents in book Marti Hearst UCB SIMS, Fall 98
Example: Me. SH and Med. Line n Me. SH Category Hierarchy n ~18, 000 labels n manually assigned n ~8 labels/article on average n avg depth: 4. 5, max depth 9 n Top Level Categories: anatomy animals disease drugs diagnosis psych biology physics Marti Hearst UCB SIMS, Fall 98 related disc technology humanities
Large Category Sets n n Problems for User Interfaces n Too many categories to browse n Too many docs per category n Docs belong to multiple categories n Need to integrate search n Need to show the documents We’ll discuss this more next week. Marti Hearst UCB SIMS, Fall 98
Category Labels n n Advantages: n Interpretable n Capture summary information n Describe multiple facets of content n Domain dependent, and so descriptive Disadvantages n Do not scale well (for organizing documents) n Domain dependent, so costly to acquire n May mis-match users’ Marti Hearst interests UCB SIMS, Fall 98
Other Starting Points Approaches n n Co-citation Links Examples, Guided Tours Marti Hearst UCB SIMS, Fall 98
Next Week n n Interfaces for Subject Codes/Category Hierarchies Leader: Alison Brandt Marti Hearst UCB SIMS, Fall 98
- Slides: 51