Mining Text and Web Data n Text mining
Mining Text and Web Data n Text mining, natural language processing and information extraction: An Introduction n Text categorization methods n Mining Web linkage structures n n 2021/2/28 Based on the slides by Deng Cai Summary Data Mining: Principles and Algorithms 1
Outline n Background on Web Search n VIPS (VIsion-based Page Segmentation) n Block-based Web Search n Block-based Link Analysis n Web Image Search & Clustering 2021/2/28 Data Mining: Principles and Algorithms 2
Search Engine – Two Rank Functions Ranking based on link structure analysis Search Rank Functions Similarity based on content or text Importance Ranking (Link Analysis) Relevance Ranking Backward Link (Anchor Text) Indexer Inverted Index Term Dictionary (Lexicon) Web Topology Graph Anchor Text Generator Meta Data Forward Index Forward Link Web Graph Constructor URL Dictioanry Web Page Parser Web Pages 2021/2/28 Data Mining: Principles and Algorithms 3
Relevance Ranking • Inverted index - A data structure for supporting text queries - like index in a book indexing disks with documents aalborg. . . armada armadillo armani. . . zz 3452, 11437, …. . 4, 19, 29, 98, 143, . . . 145, 457, 789, . . . 678, 2134, 3970, . . . 90, 256, 372, 511, . . . 602, 1189, 3209, . . . inverted index
The Page. Rank Algorithm n Basic idea n n significance of a page is determined by the significance of the pages linking to it More precisely: n n 2021/2/28 Link graph: adjacency matrix A, Constructs a probability transition matrix M by renormalizing each row of A to sum to 1 Treat the web graph as a markov chain (random surfer) The vector of Page. Rank scores p is then defined to be the stationary distribution of this Markov chain. Equivalently, p is the principal right eigenvector of the transition matrix Data Mining: Principles and Algorithms 5
Layout Structure n Compared to plain text, a web page is a 2 D presentation n Rich visual effects created by different term types, formats, separators, blank areas, colors, pictures, etc n Different parts of a page are not equally important Title: CNN. com International H 1: IAEA: Iran had secret nuke agenda H 3: EXPLOSIONS ROCK BAGHDAD … TEXT BODY (with position and font type): The International Atomic Energy Agency has concluded that Iran has secretly produced small amounts of nuclear materials including low enriched uranium and plutonium that could be used to develop nuclear weapons according to a confidential report obtained by CNN… Hyperlink: • URL: http: //www. cnn. com/. . . • Anchor Text: AI oaeda… Image: • URL: http: //www. cnn. com/image/. . . • Alt & Caption: Iran nuclear … Anchor Text: CNN Homepage News … 2021/2/28 Data Mining: Principles and Algorithms 6
Web Page Block—Better Information Unit Web Page Blocks Importance = Low Importance = Med Importance = High 2021/2/28 Data Mining: Principles and Algorithms 7
Motivation for VIPS (VIsion-based Page Segmentation) n Problems of treating a web page as an atomic unit n Web page usually contains not only pure content n Noise: navigation, decoration, interaction, … Multiple topics n Different parts of a page are not equally important Web page has internal structure n Two-dimension logical structure & Visual layout presentation n > Free text document n < Structured document Layout – the 3 rd dimension of Web page st n 1 dimension: content nd dimension: hyperlink n 2 n n n 2021/2/28 Data Mining: Principles and Algorithms 8
Is DOM a Good Representation of Page Structure? n Page segmentation using DOM n Extract structural tags such as P, TABLE, UL, TITLE, H 1~H 6, etc n n 2021/2/28 DOM is more related content display, does not necessarily reflect semantic structure How about XML? n A long way to go to replace the HTML Data Mining: Principles and Algorithms 9
VIPS Algorithm n n Motivation: n In many cases, topics can be distinguished with visual clues. Such as position, distance, font, color, etc. Goal: n Extract the semantic structure of a web page based on its visual presentation. Procedure: n Top-down partition the web page based on the separators Result n A tree structure, each node in the tree corresponds to a block in the page. n Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception. n Each block will be assigned an importance value n Hierarchy or flat 2021/2/28 Data Mining: Principles and Algorithms 10
VIPS: An Example n n n A hierarchical structure of layout block A Degree of Coherence (DOC) is defined for each block n Show the intra coherence of the block n Do. C of child block must be no less than its parent’s The Permitted Degree of Coherence (PDOC) can be pre-defined to achieve different granularities for the content structure n The segmentation will stop only when all the blocks’ Do. C is no less than PDo. C n 2021/2/28 The smaller the PDo. C, the coarser the content structure would be Data Mining: Principles and Algorithms 11
Example of Web Page Segmentation (1) ( DOM Structure ) 2021/2/28 ( VIPS Structure ) Data Mining: Principles and Algorithms 12
Example of Web Page Segmentation (2) ( DOM Structure ) n 2021/2/28 ( VIPS Structure ) Can be applied on web image retrieval n Surrounding text extraction Data Mining: Principles and Algorithms 13
Web Page Block—Better Information Unit Page Segmentation Block Importance Modeling • Vision based approach • Statistical learning Web Page Blocks Importance = Low Importance = Med Importance = High 2021/2/28 Data Mining: Principles and Algorithms 14
Block-based Web Search n n n 2021/2/28 Index block instead of whole page Block retrieval n Combing Doc. Rank and Block. Rank Block query expansion n Select expansion term from relevant blocks Data Mining: Principles and Algorithms 15
Experiments n Dataset n TREC 2001 Web Track n n n TREC 2002 Web Track n n n WT 10 g corpus (1. 69 million pages), crawled at 1997. 50 queries (topics 501 -550). GOV corpus (1. 25 million pages), crawled at 2002. 49 queries (topics 551 -560) Retrieval System n Okapi, with weighting function BM 2500 Preprocessing n Stop-word list (about 220) n Do not use stemming n Do not consider phrase information Tune the b, k 1 and k 3 to achieve the best baseline 2021/2/28 Data Mining: Principles and Algorithms 16
Block Retrieval on TREC 2001 and TREC 2002 TREC 2001 Result 2021/2/28 TREC 2002 Result Data Mining: Principles and Algorithms 17
Query Expansion on TREC 2001 and TREC 2002 TREC 2001 Result 2021/2/28 TREC 2002 Result Data Mining: Principles and Algorithms 18
Block-level Link Analysis B A 2021/2/28 C Data Mining: Principles and Algorithms 19
A Sample of User Browsing Behavior 2021/2/28 Data Mining: Principles and Algorithms 20
Improving Page. Rank using Layout Structure n Z: block-to-page matrix (link structure) n X: page-to-block matrix (layout structure) n Block-level Page. Rank: n n Compute Page. Rank on the page-to-page graph Block. Rank: n 2021/2/28 Compute Page. Rank on the block-to-block graph Data Mining: Principles and Algorithms 21
Using Block-level Page. Rank to Improve Search Block-level Page. Rank a Search = a * IR_Score + (1 - a) * Page. Rank Block-level Page. Rank achieves 15 -25% improvement over Page. Rank (SIGIR’ 04) 2021/2/28 Data Mining: Principles and Algorithms 22
Mining Web Images Using Layout & Link Structure (ACMMM’ 04) 2021/2/28 Data Mining: Principles and Algorithms 23
Image Graph Model & Spectral Analysis n Block-to-block graph: n Block-to-image matrix (container relation): Y n Image-to-image graph: n Image. Rank n n Compute Page. Rank on the image graph Image clustering n 2021/2/28 Graphical partitioning on the image graph Data Mining: Principles and Algorithms 24
Image. Rank n Relevance Ranking 2021/2/28 n Importance Ranking Data Mining: Principles and Algorithms n Combined Ranking 25
Image. Rank vs. Page. Rank n n Dataset n 26. 5 millions web pages n 11. 6 millions images Query set n 45 hot queries in Google image search statistics Ground truth n Five volunteers were chosen to evaluate the top 100 results re-turned by the system (i. Find) Ranking method 2021/2/28 Data Mining: Principles and Algorithms 26
Image. Rank vs Page. Rank n Image search accuracy using Image. Rank and Page. Rank. Both of them achieved their best results at =0. 25. 2021/2/28 Data Mining: Principles and Algorithms 27
Example on Image Clustering & Embedding 1710 JPG images in 1287 pages are crawled within the website http: //www. yahooligans. com/content/animals/ Six Categories Fish Mammal Bird 2021/2/28 Amphibian Data Mining: Principles and Algorithms Reptile Insect 28
2021/2/28 Data Mining: Principles and Algorithms 29
2 -D embedding of WWW images The image graph was constructed from block level link analysis 2021/2/28 The image graph was constructed from traditional page level link analysis Data Mining: Principles and Algorithms 30
2 -D Embedding of Web Images n 2021/2/28 2 -D visualization of the mammal category using the second and third eigenvectors. Data Mining: Principles and Algorithms 31
Web Image Search Result Presentation (a) (b) Figure 1. Top 8 returns of query “pluto” in Google’s image search engine (a) and Alta. Vista’s image search engine (b) n n 2021/2/28 Two different topics in the search result A possible solution: n Cluster search results into different semantic groups Data Mining: Principles and Algorithms 32
Three kinds of WWW image representation n 2021/2/28 Visual Feature Based Representation n Traditional CBIR Textual Feature Based Representation n Surrounding text in image block Link Graph Based Representation n Image graph embedding Data Mining: Principles and Algorithms 33
Hierarchical Clustering n Clustering based on three representations n Visual feature n n Textual feature n n n Semantic Sometimes the surrounding text is too little Link graph: n n n Hard to reflect the semantic meaning Semantic Many disconnected sub-graph (too many clusters) Two Steps: n Using texts and link information to get semantic clusters n For each cluster, using visual feature to re-organize the images to facilitate user’s browsing 2021/2/28 Data Mining: Principles and Algorithms 34
Our System n Dataset n 26. 5 millions web pages http: //dir. yahoo. com/Arts/Visual_Arts/Photography/Museums_and_Galleries/ n 11. 6 millions images n n Filter images whose ratio between width and height are greater than 5 or smaller than 1/5 Removed images whose width and height are both smaller than 60 pixels Analyze pages and index images n VIPS: Pages Blocks n Surrounding texts used to index images An illustrative example n Query “Pluto” n Top 500 results 2021/2/28 Data Mining: Principles and Algorithms 35
Clustering Using Visual Feature Figure 5. Five clusters of search results of query “pluto” using low level visual feature. Each row is a cluster. n 2021/2/28 From the perspectives of color and texture, the clustering results are quite good. Different clusters have different colors and textures. However, from semantic perspective, these clusters make little sense. Data Mining: Principles and Algorithms 36
Clustering Using Textual Feature Figure 6. The Eigengap curve with k for the “pluto” case using textual representation Figure 7. Six clusters of search results of query “pluto” using textual feature. Each row is a cluster n 2021/2/28 Six semantic categories are correctly identified if we choose k = 6. Data Mining: Principles and Algorithms 37
Clustering Using Graph Based Representation Figure 8. Five clusters of search results of query “pluto” using image link graph. Each row is a cluster n n n Each cluster is semantically aggregated. Too many clusters. In “pluto” case, the top 500 results are clustered into 167 clusters. The max cluster number is 87, and there are 112 clusters with only one image. 2021/2/28 Data Mining: Principles and Algorithms 38
Combining Textual Feature and Link Graph Figure 10. The Eigengap curve with k for the “pluto” case using textual and link combination Figure 9. Six clusters of search results of query “pluto” using combination of textual feature and image link graph. Each row is a cluster n Combine two affinity matrix 2021/2/28 Data Mining: Principles and Algorithms 39
Final Presentation of Our System n n 2021/2/28 Using textual and link information to get some semantic clusters Use low level visual feature to cluster (re-organize) each semantic cluster to facilitate user’s browsing Data Mining: Principles and Algorithms 40
Summary n n n 2021/2/28 More improvement on web search can be made by mining webpage Layout structure Leverage visual cues for web information analysis & information extraction Demos: n http: //www. ews. uiuc. edu/~dengcai 2 n Papers n VIPS demo & dll Data Mining: Principles and Algorithms 41
References n n n n Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Extracting Content Structure for Web Pages based on Visual Representation”, The Fifth Asia Pacific Web Conference, 2003. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “VIPS: a Vision-based Page Segmentation Algorithm”, Microsoft Technical Report (MSR-TR-2003 -79), 2003. Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, “Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation”, 12 th International World Wide Web Conference (WWW 2003), May 2003. Ruihua Song, Haifeng Liu, Ji-Rong Wen and Wei-Ying Ma, “Learning Block Importance Models for Web Pages”, 13 th International World Wide Web Conference (WWW 2004), May 2004. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Block-based Web Search”, SIGIR 2004, July 2004. Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma, “Block-Level Link Analysis”, SIGIR 2004, July 2004. Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen and Hong-Jiang Zhang, “Organizing WWW Images Based on The Analysis of Page Layout and Web Link Structure”, The IEEE International Conference on Multimedia and EXPO (ICME'2004) , June 2004 Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma and Ji-Rong Wen, “Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis”, 12 th ACM International Conference on Multimedia, Oct. 2004. 2021/2/28 Data Mining: Principles and Algorithms 42
- Slides: 42