Course Roadmap n n Introduction to Data mining

Efficient Identification of Web Communities GW Flake, S. Lawrence and CL Giles ACM SIGKDD

Outline n n n Motivation Key Contribution Relevant Prior Work Methodology Results

Motivation n Shear number of indexable Web. Pages n n n Results should be

Key contribution n A web community enables n n n Web crawlers to find

Related prior work n n n HITS algorithm Graph cuts and partitions Maximum flow

HITS algorithm n Hubs and authorities n n An authority is one web page

Graph cuts and partitions n n Identifying the community is equivalent to partition a

Maximum flow and minimal cut n n Definition: Given a directed graph G=(V, E)

Maximum flow Minimum cut n n n A “cut” is a line (or a

Web community (proposed approach) n n Community: A COMMUNITY is a vertex subset C

Identification of Web community n n Theorem 1: A Community, C, can be identified

Communities from links n n Choice of seeds: using HITS to discover a group

Problems, solutions and approximations n n n If entire web is available, it the

Approximate communities (e) (a) (b) (c) (d) Focused community Crawling and graph induced. (a)

Virtual sink Outside of the comunity Community Source Approximat e communitie s Cut set

Focused crawl algorithm(graph: G=(V, E), vertex s, t V) n n While number of

Maximum flow problem: The incremental shortest Augmentation algorithm n Intution n n Finds shortest

Experimental results n Support vector machine community n n The Internet Archive Community n

Summary and Conclusions n n n Defined a new type of web community which

Slides: 20

Download presentation

Course Roadmap n n Introduction to Data mining n Introduction, Preprocessing, Clustering, Association Rules, Classification, … Web structure mining n Authoritative Sources in a Hyperlinked environment, by John. M. Kleonberg n Efficient Crawling through URL ordering, J. Cho, H. Garcia Molina and L. Page n The Page. Rank Citation Ranking: Bringing order to the Web. n The anatomy large scale Hyper-textual Web Search engine, Sergey Brin and Lawrance Page n Graph Structure in the Web, Andrei Broder et al n Focused Crawling: A new approach to topic-specific web resource discovery, by Souman Chakravarthi et al. n Trawling the web for emerging cyber communities, Ravi Kumar… n Building a cyber-community hierarchy based on link analysis, P. Krishna Reddy and Masaru Kitsuregawa n Efficient identification of web communities, GW Flake, S. . Lawrence, et al. n Finding related pages in WWW, J Dean and MR Herizinger Web content mining/Information Retrieval Web log mining/Recommendation systems/E-commerce

Efficient Identification of Web Communities GW Flake, S. Lawrence and CL Giles ACM SIGKDD 2000 Instructor: P. Krishna Reddy E-mail: pkreddy@iiit. net http: //www. iiit. net/~pkreddy

Outline n n n Motivation Key Contribution Relevant Prior Work Methodology Results

Motivation n Shear number of indexable Web. Pages n n n Results should be valid All the valuable documents should be indexed. Balance between Precision and recall n n High recall and low precision: Broad queries to general search engines return thousands of results. Low recall & high precision: Organize a subset of the web into a hierarchic structure. (Yahoo approach)

Key contribution n A web community enables n n n Web crawlers to find effectively focus on narrow but topically related subsets of the web. Proves that a community can be identified in a max flow-min cut frame work. Focused-crawl algorithm to approximate community membership, which increases the precision and recall of search results.

Related prior work n n n HITS algorithm Graph cuts and partitions Maximum flow and Minimal cuts

HITS algorithm n Hubs and authorities n n An authority is one web page that is given wide endorsement by different authors on the Web. A hub is a page that provides collections of links to authorities Hubs links to many authorities and authority linked by many hubs. Hubs and authorities are very useful for identifying seed (key) sites related to some community.

Graph cuts and partitions n n Identifying the community is equivalent to partition a graph such that the edge weight between the partitions is minimized while maintaining partitions of a minimal size (balanced minimal cut) The problem is NP complete.

Maximum flow and minimal cut n n Definition: Given a directed graph G=(V, E) with edge capacities c(u, v) belong to Z+, find the maximum flow that can be routed from the source, s, to the sink, t, that obeys all capacity constraints. Ford Fulerson’s method n n Given a flow network G with source s and sink t, this method finds a flow of maximum value from s to t. Advantages: Computationally tractable and many polynomial time algorithms exists for solving this.

Maximum flow Minimum cut n n n A “cut” is a line (or a collection of lines) that separates the source node to the sink(end) node. “Min cut”: A minimum cut is a cut, its cut(S, T) : f(s, t) is minimum. Max flow min cut theorem of Ford Fulerson proves that maximum flow of the network is identical to the minimum cut that separates s and t. n Applications: VLSI design, routing, scheduling, image segmentation, network reliability testing

Web community (proposed approach) n n Community: A COMMUNITY is a vertex subset C (subset of V), such that for all vertices V belong to C, v has at least as many edges connecting in C as it does to vertices in (V-C) Example: A data mining community is a community that links to or linked by more data mining sites than non data mining sites.

Identification of Web community n n Theorem 1: A Community, C, can be identified by calculating the s-t minimum cut of G with s and t being used as the source and sink, respectively, provided that both s# and t# exceed the cut set size. After the cut, vertices that reachable from s are in the community. s# refers to the number of edges between s and all vertices in (C-s), t#, refers to the umber of edges between t and all vertices in (V-C-t)

Communities from links n n Choice of seeds: using HITS to discover a group of authorities and hubs on indexed web pages. Choice of sink: Using a small of collection of web portals as a virtual sink because they should be very close to the center of the web graph.

Problems, solutions and approximations n n n If entire web is available, it the direct application of existing algorithm In this paper a method is proposed to extract community by considering a portion of the web. Requirement of rapid access to the inbound and outbound links n n www. archive. org is used to maintain a snapshot of the web. Find outbound links: by downloading the html Find inbound links: by querying search engine No true sink is used in the algorithm. Instead an artificial vertex most distant from the source is used as a sink.

Approximate communities (e) (a) (b) (c) (d) Focused community Crawling and graph induced. (a) virtual source (b) vertices Of seed web sites (c) vertices of web sites one link a way from any seed sites (d) references to sites not in b and c. and (e) the virtual sink vertex.

Virtual sink Outside of the comunity Community Source Approximat e communitie s Cut set Focused community Crawling and graph induced. (a) virtual source (b) vertices of seed web sites (c) vertices of web sites one link a way from any seed sites (d) references to sites not in b and c. and (e) the virtual sink vertex.

Focused crawl algorithm(graph: G=(V, E), vertex s, t V) n n While number of iterations is less than desired do Set k equal to number of vertices in seed set Perform maximum flow analysis of G, yielding community C Identify non-seed vertex, v’ C, with highest in-degree relative to G. n n n Identify non-seed vertex, v’ relative to G. n n Add all nodes (such that the in degree equal to v’) to seed set. Add edge (s, v) to E with infinite capacity C, with highest out-degree Add all nodes (such that out-degree equal to v’) to seed set. Add edge (s, v) to E with infinite capacity Re-crawl so that G uses all seeds. Let G reflect new information from the crawl

Maximum flow problem: The incremental shortest Augmentation algorithm n Intution n n Finds shortest paths between source and sink Remove shortest paths until source and sink are separated.

Experimental results n Support vector machine community n n The Internet Archive Community n n Graph size: 11000 Community size: 252 Results: related to SVM research. Graph size: 7000 Community size: 289 results closely related to the mission of the IA. The “Ronald Rivest” Community n Graph size: 38000 Community size=150 results closely related to his research

Summary and Conclusions n n n Defined a new type of web community which can be calculated in a maximum-flow framework. Maximum flow based web crawler is proposed. Discovered communities are very cohesive Runtime is dominated by network demands of retrieing web pages. Flow algorithm is very fast, fast enough for online use. Applications n n n Focused crawlers Population of portal categories Improved filtering