Visualizing Document Collections cs 5764 Information Visualization Chris





























- Slides: 29
Visualizing Document Collections cs 5764: Information Visualization Chris North
Where are we? • • Multi-D 1 D 2 D 3 D Trees Graphs Document collections • Design Principles • Empirical Evaluation • Visual Overviews
Structured Document Collections • Multi-dimensional • author, title, date, journal, … • Trees • Dewey decimal • Graphs • web, citations
Envision • Ed Fox, et al. • Multi-D • similar to Spotfire
Citation Networks • Butterfly Browser • Mackinlay et al (PARC) Butterfly: Left = refs Right = citers Yellow = #citers Blue = visited 3 d plot: date, Name, # citers
Unstructured Document Collections • Focus on Full Text • Examples: • digital libraries, news archives, web pages • email archives, image galery • Tasks: • • • search Browse Classification, structurization Statistics, keyword usage, languages Subjects, themes, coverage
Visualization Strategies • Cluster Maps • Keyword Query • Relationships • Reduced representation • User controlled layout
Cluster Map • Create a “map” of the document collection • Similar documents near each other • Dissimilar document far apart • “Grocery store” concept
Document Vectors • • “aardvark” “banana” “chris” … Doc 1 1 2 0 Doc 2 2 1 0 Doc 3 0 0 3 • Similarity between pair of docs = • dot product • Layout documents in 2 -D map by similarity • similar to spring model for graph layout …
Cluster Algorithms • Partition clustering: Partition into k subsets • Pick k seeds • Iteratively attract nearest neighbors • Hierarchical clustering: Dendrogram • Group nearest-neighbor pair • Iterate
Landscapes • Wise et al, “Visualizing the non-visual” • Theme. Scapes, Cartia • PNNL • Mountain height = Cluster size
Kohonen Maps • Xia Lin, “Document Space” • • http: //faculty. cis. drexel. edu/sitemap/index. html
Web. SOM • http: //websom. hut. fi/websom/
Map. net • http: //maps. map. net/start
Galaxy of News MIT Cluster map with full text zooming
Cluster Map • Good: • • Map of collection Major themes and sizes Relationships between themes Scales up • Bad: • Where to locate documents with multiple themes? » Both mountains, between mountains, …? • Relationships between documents, within documents? • Algorithm becomes (too) critical
Keyword Query • Keyword query, Search engine • Rank ordered list • “Information Retrieval” • Visualization of results
Keyword Distributions • Hearst, “Tile. Bars” • • http: //elib. cs. berkeley. edu/tilebars/ • Keyword distributions within documents
Document Distributions • Korfhage, “VIBE” • http: //www. pitt. edu/~korfhage/interfaces. html • Documents located between query keywords using spring model
VR-VIBE
Keyword Query • Good: • Reduces the browsing space • Map according to user’s interests • Bad: • What keywords do I use? • What about other related documents that don’t use these keywords? • No initial overview • Mega-hit, zero-hit problem
Relationships • Show inter-relationships • Matrix or Complete Graph • Similarity measure between all pairs of docs • Threshold level • Salton
Variations Docs + Paragraphs Themes
Relationships • Better for smaller, more detailed map • Scale up: Network visualization • Good: • Can see more complex relationships between/within documents • Can act like hyperlinks! • Bad: • Finding specific documents • Scale up difficult
Reduced Visual Representation • Bederson, “Image browsing” •
User Controlled Layout • Card, “Web. Book and Web Forager” • • http: //vtopus. cs. vt. edu/~north/infoviz/webbook. mpa
Data Mountain • Robertson, “Data Mountain” • (Microsoft)