SIMS 247 Information Visualization and Presentation Prof Marti

  • Slides: 49
Download presentation
SIMS 247 Information Visualization and Presentation Prof. Marti Hearst October 5, 2000

SIMS 247 Information Visualization and Presentation Prof. Marti Hearst October 5, 2000

Today and Next Time • Why Text is Tough • Visualizing Concept Spaces –

Today and Next Time • Why Text is Tough • Visualizing Concept Spaces – Clusters – Category Hierarchies • Visualizing Query Specifications – Selecting Term Subsets – Viewing Metadata • Visualizing Retrieval Results – Term Hit Distribution – Grouping of Retrieved Documents

Why Visualize Text? • To help with Information Retrieval – give an overview of

Why Visualize Text? • To help with Information Retrieval – give an overview of a collection – show user what aspects of their interests are present in a collection – help user understand why documents retrieved as a result of a query • Text Data Mining – not much has been done in this yet • Software Engineering – not really text, but has some similar properties

Why Text is Tough • Text is not pre-attentive • Text consists of abstract

Why Text is Tough • Text is not pre-attentive • Text consists of abstract concepts – which are difficult to visualize • Text represents similar concepts in many different ways – space ship, flying saucer, UFO, figment of imagination • Text has very high dimensionality – Tens or hundreds of thousands of features – Many subsets can be combined together

Why Text is Tough The Dog.

Why Text is Tough The Dog.

Why Text is Tough The Dog. The dog cavorts. The dog cavorted.

Why Text is Tough The Dog. The dog cavorts. The dog cavorted.

Why Text is Tough The man walks.

Why Text is Tough The man walks.

Why Text is Tough The man walks the cavorting dog. So far, we can

Why Text is Tough The man walks the cavorting dog. So far, we can sort of show this in pictures.

Why Text is Tough As the man walks the cavorting dog, thoughts arrive unbidden

Why Text is Tough As the man walks the cavorting dog, thoughts arrive unbidden of the previous spring, so unlike this one, in which walking was marching and dogs were baleful sentinals outside unjust halls. How do we visualize this?

Why Text is Tough • Abstract concepts are difficult to visualize • Combinations of

Why Text is Tough • Abstract concepts are difficult to visualize • Combinations of abstract concepts are even more difficult to visualize – time – shades of meaning – social and psychological concepts – causal relationships

Why Text is Tough • Language only hints at meaning • Most meaning of

Why Text is Tough • Language only hints at meaning • Most meaning of text lies within our minds and common understanding – “How much is that doggy in the window? ” • how much: social system of barter and trade (not the size of the dog) • “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own • “in the window” implies behind a store window, not really inside a window, requires notion of window shopping

Why Text is Tough • General categories have no standard ordering (nominal data) •

Why Text is Tough • General categories have no standard ordering (nominal data) • Categorization of documents by single topics misses important distinctions • Consider an article about – NAFTA – The effects of NAFTA on truck manufacture – The effects of NAFTA on productivity of truck manufacture in the neighboring cities of El Paso and Juarez

Why Text is Tough • Other issues about language – ambiguous (many different meanings

Why Text is Tough • Other issues about language – ambiguous (many different meanings for the same words and phrases) – different combinations imply different meanings

Why Text is Tough • I saw Pathfinder on Mars with a telescope. •

Why Text is Tough • I saw Pathfinder on Mars with a telescope. • Pathfinder photographed Mars. • The Pathfinder photograph mars our perception of a lifeless planet. • The Pathfinder photograph from Ford has arrived. • The Pathfinder forded the river without marring its paint job.

Why Text is Easy • Text is highly redundant – When you have lots

Why Text is Easy • Text is highly redundant – When you have lots of it – Pretty much any simple technique can pull out phrases that seem to characterize a document • Instant summary: – Extract the most frequent words from a text – Remove the most common English words

Guess the Text • • • • 478 said 233 god 201 father 187

Guess the Text • • • • 478 said 233 god 201 father 187 land 181 jacob 160 son 157 joseph 134 abraham 121 earth 119 man 118 behold 113 years 104 wife 101 name 94 pharaoh

Text Collection Overviews • How can we show an overview of the contents of

Text Collection Overviews • How can we show an overview of the contents of a text collection? – Show info external to the docs • e. g. , date, author, source, number of inlinks • does not show what they are about – Show the meanings or topics in the docs • a list of titles • results of clustering words or documents • organize according to categories (next time)

Clustering for Collection Overviews – Scatter/Gather • show main themes as groups of text

Clustering for Collection Overviews – Scatter/Gather • show main themes as groups of text summaries – Scatter Plots • show docs as points; closeness indicates nearness in cluster space • show main themes of docs as visual clumps or mountains – Kohonen Feature maps • show main themes as adjacent polygons – BEAD • show main themes as links within a forcedirected placement network

Clustering for Collection Overviews • Two main steps – cluster the documents according to

Clustering for Collection Overviews • Two main steps – cluster the documents according to the words they have in common – map the cluster representation onto a (interactive) 2 D or 3 D representation

Text Clustering • Finds overall similarities among groups of documents • Finds overall similarities

Text Clustering • Finds overall similarities among groups of documents • Finds overall similarities among groups of tokens • Picks out some themes, ignores others

Scatter/Gather

Scatter/Gather

S/G Example: query on “star” Encyclopedia text 8 symbols 68 film, tv (p) 97

S/G Example: query on “star” Encyclopedia text 8 symbols 68 film, tv (p) 97 astrophysics 67 astronomy(p) 10 flora/fauna 14 sports 47 film, tv 7 music 12 steller phenomena 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 • How

Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 • How it works – Cluster sets of documents into general “themes”, like a table of contents – Display the contents of the clusters by showing topical terms and typical titles – User chooses subsets of the clusters and re-clusters the documents within – Resulting new groups have different “themes” • Originally used to give collection overview • Evidence suggests more appropriate for displaying retrieval results in context • Appearing (sort-of) in commercial systems

Northern Light: used to cluster exclusively. Now combines categorization with clustering

Northern Light: used to cluster exclusively. Now combines categorization with clustering

Northern Light second level clusters: are these really about NLP? Note that next level

Northern Light second level clusters: are these really about NLP? Note that next level corresponds to URLs

(Chen et al. 97) Scatter Plot of Clusters

(Chen et al. 97) Scatter Plot of Clusters

BEAD (Chalmers 97)

BEAD (Chalmers 97)

BEAD (Chalmers 96) An example layout produced by Bead, seen in overview, of 831

BEAD (Chalmers 96) An example layout produced by Bead, seen in overview, of 831 bibliography entries. The dimensionality (the number of unique words in the set) is 6925. A search for ‘cscw or collaborative’ shows the pattern of occurrences coloured dark blue, mostly to the right. The central rectangle is the visualizer’s motion control.

Themescapes (Wise et al. 95) Example: Themescapes (Wise et al. 95)

Themescapes (Wise et al. 95) Example: Themescapes (Wise et al. 95)

Clustering for Collection Overviews • Since text has tens of thousands of features –

Clustering for Collection Overviews • Since text has tens of thousands of features – the mapping to 2 D loses a tremendous amount of information – only very coarse themes are detected

Galaxy of News Rennison 95

Galaxy of News Rennison 95

Galaxy of News Rennison 95

Galaxy of News Rennison 95

(594 docs) (Lin 92, Chen et al. 97) Kohonen Feature Maps

(594 docs) (Lin 92, Chen et al. 97) Kohonen Feature Maps

Study of Kohonen Feature Maps • H. Chen, A. Houston, R. Sewell, and B.

Study of Kohonen Feature Maps • H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS 49(7) • Comparison: Kohonen Map and Yahoo • Task: – “Window shop” for interesting home page – Repeat with other interface • Results: – Starting with map could repeat in Yahoo (8/11) – Starting with Yahoo unable to repeat in map (2/14)

How Useful is Collection Cluster Visualization for Search? Three studies find negative results

How Useful is Collection Cluster Visualization for Search? Three studies find negative results

Study 1 • Kleiboemer, Lazear, and Pedersen. Tailoring a retrieval system for naive users.

Study 1 • Kleiboemer, Lazear, and Pedersen. Tailoring a retrieval system for naive users. In Proc. of the 5 th Annual Symposium on Document Analysis and Information Retrieval, 1996 • This study compared – a system with 2 D graphical clusters – a system with 3 D graphical clusters – a system that shows textual clusters • Novice users • Only textual clusters were helpful (and they were difficult to use well)

Study 2: Kohonen Feature Maps • H. Chen, A. Houston, R. Sewell, and B.

Study 2: Kohonen Feature Maps • H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS 49(7) • Comparison: Kohonen Map and Yahoo • Task: – “Window shop” for interesting home page – Repeat with other interface • Results: – Starting with map could repeat in Yahoo (8/11) – Starting with Yahoo unable to repeat in map (2/14)

Study 2 (cont. ) • Participants liked: – Correspondence of region size to #

Study 2 (cont. ) • Participants liked: – Correspondence of region size to # documents – Overview (but also wanted zoom) – Ease of jumping from one topic to another – Multiple routes to topics – Use of category and subcategory labels

Study 2 (cont. ) • Participants wanted: – – – – – hierarchical organization

Study 2 (cont. ) • Participants wanted: – – – – – hierarchical organization other ordering of concepts (alphabetical) integration of browsing and search correspondence of color to meaning more meaningful labels at same level of abstraction fit more labels in the given space combined keyword and category search multiple category assignment (sports+entertain)

Study 3: NIRVE • NIRVE Interface by Cugini et al. 96. Each rectangle is

Study 3: NIRVE • NIRVE Interface by Cugini et al. 96. Each rectangle is a cluster. Larger clusters closer to the “pole”. Similar clusters near one another. Opening a cluster causes a projection that shows the titles.

Study 3 • Visualization of search results: a comparative evaluation of text, 2 D,

Study 3 • Visualization of search results: a comparative evaluation of text, 2 D, and 3 D interfaces Sebrechts, Cugini, Laskowski, Vasilakis and Miller, Proceedings of SIGIR 99, Berkeley, CA, 1999. • This study compared: – 3 D graphical clusters – 2 D graphical clusters – textual clusters • 15 participants, between-subject design • Tasks – – – Locate a particular document Locate and mark a particular document Locate a previously marked document Locate all clusters that discuss some topic List more frequently represented topics

Study 3 • Results (time to locate targets) – – Text clusters fastest 2

Study 3 • Results (time to locate targets) – – Text clusters fastest 2 D next 3 D last With practice (6 sessions) 2 D neared text results; 3 D still slower – Computer experts were just as fast with 3 D • Certain tasks equally fast with 2 D & text – Find particular cluster – Find an already-marked document • But anything involving text (e. g. , find title) much faster with text. – Spatial location rotated, so users lost context • Helpful viz features – Color coding (helped text too) – Relative vertical locations

Visualizing Clusters • Huge 2 D maps may be inappropriate focus for information retrieval

Visualizing Clusters • Huge 2 D maps may be inappropriate focus for information retrieval – cannot see what the documents are about – space is difficult to browse for IR purposes – (tough to visualize abstract concepts) • Perhaps more suited for pattern discovery and gist-like overviews

Co-Citation Analysis • Has been around since the 50’s. (Small, Garfield, White & Mc.

Co-Citation Analysis • Has been around since the 50’s. (Small, Garfield, White & Mc. Cain) • Used to identify core sets of – authors, journals, articles for particular fields – Not for general search • Main Idea: – Find pairs of papers that cite third papers – Look for commonalitieis • A nice demonstration by Eugene Garfield at: – http: //165. 123. 33/eugene_garfield/papers/mapsciworld. html

Co-citation analysis (From Garfield 98)

Co-citation analysis (From Garfield 98)

Co-citation analysis (From Garfield 98)

Co-citation analysis (From Garfield 98)

Co-citation analysis (From Garfield 98)

Co-citation analysis (From Garfield 98)

Visualizing Clusters • Huge 2 D maps may be inappropriate focus for information retrieval

Visualizing Clusters • Huge 2 D maps may be inappropriate focus for information retrieval – cannot see what the documents are about – documents are forced into one position in semantic space – space is difficult to browse for IR purposes • Perhaps more suited for pattern discovery – problem: often only one view on the space

Next Time • Visualizing Category Overviews • Visualizing Query Term Specification – available words

Next Time • Visualizing Category Overviews • Visualizing Query Term Specification – available words – available metadata • Visualizing Retrieval Results