Data Mining Algorithms Web Mining Web Mining Outline

Data Mining Algorithms Web Mining

Web Mining Outline Goal: Examine the use of data mining on the World Wide Web Introduction Web Content Mining Web Structure Mining Web Usage Mining 2

Introduction The Web is perhaps the single largest data source in the world. Web mining aims to extract and mine useful knowledge from the Web. A multidisciplinary field: data mining, machine learning, natural language processing, statistics, databases, information retrieval, multimedia, etc. Due to the heterogeneity and lack of structure of Web data, mining is a challenging task. 3

Opportunities and Challenges The amount of info on the Web is huge, and easily accessible. The coverage of Web info is very wide and diverse. Info/data of almost all types exist on the Web, e. g. , structured tables, texts, multimedia data, etc. Much of the Web information is semi-structured due to the nested structure of HTML code. Much of the Web info is linked. hyperlinks among pages within a site, and across different sites. Much of the Web info is redundant. Same piece of info or its variants may appear in many pages. 4

Opportunities and Challenges Web is noisy. A Web page typically contains many kinds of info, e. g. , main contents, advertisements, navigation panels, copyright notices, etc. Web consists of surface Web and deep Web. – Surface Web: pages that can be browsed using a browser. – Deep Web: can only be accessed thro parameterized QI. Web is also about services. Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues. The Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems, i. e. communities. 5

Web Mining Other Issues Size – The Indexed Web contains at least 4. 16 billion pages (Thursday, 10 October, 2013). The Dutch Indexed Web contains at least 188. 34 million pages (Thursday, 10 October, 2013) – Grows at about 1 million pages a day – Google indexes > 45 billion documents Diverse types of data So not possible to warehouse or normal data mining 6

7

Web Data Web pages Intra-page structures (HTML, XML code) Inter-page structures (actual linkage structures between web pages) Usage data Supplemental data – Profiles – Registration information – Cookies 8

Web Mining Taxonomy 9

Web Mining Taxonomy Web Content Mining – Extends work of basic search engines Web Structure Mining – Mine structure (links, graph) of the Web Usage Mining – Analyses Logs of Web Access Web Mining applications include Target Advtg. , Recommendation Engines, CRM etc 10

Web Content Mining Extends work of basic search engines Web content mining: mining, extraction and integration of useful data, information and knowledge from Web page contents Search Engines – IR application, Keyword based, Similarity between query and document – Crawlers, Indexing – Profiles – Link analysis 11

Issues in Web Content Mining Developing intelligent tools for IR – Finding keywords and key phrases – Discovering grammatical rules and collocations – Hypertext classification/categorization – Extracting key phrases from text documents – Learning extraction models/rules – Hierarchical clustering – Predicting (words) relationship 12

Search Engine – Two Rank Functions Ranking based on link structure analysis Search Rank Functions Similarity based on Relevance Ranking content or text Inverted Index Term Dictionary (Lexicon) Importance Ranking (Link Analysis) Indexer Backward Link (Anchor Text) Anchor Text Generator Meta Data Forward Index Forward Link Web Topology Graph Web Graph Constructor URL Dictioanry Web Page Parser Web Pages 13

How do We Find Similar Web Pages? Content based approach Structure based approach Combing both content and structure approach 14

Relevance Ranking • Inverted index - A data structure for supporting text queries - like index in a book indexing disks with documents aalborg. . . armada armadillo armani. . . zz 3452, 11437, …. . 4, 19, 29, 98, 143, . . . 145, 457, 789, . . . 678, 2134, 3970, . . . 90, 256, 372, 511, . . . 602, 1189, 3209, . . . inverted index

Crawlers Robot (spider) traverses the hypertext structure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler – visits entire Web and replaces index Periodic Crawler – visits portions of the Web and updates subset of index Incremental Crawler – selectively searches the Web and incrementally modifies index Focused Crawler – visits pages related to a particular subject 16

Focused Crawler Only visit links from a page if that page is determined to be relevant. Classifier is static after learning phase. Components: – Classifier which assigns relevance score to each page based on crawl topic. – Distiller to identify hub pages. – Crawler visits pages based on crawler and distiller scores. 17

Focused Crawler Classifier to related documents to topics Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score. 18

Focused Crawler 19

Virtual Web View Multiple Layered Data. Base (MLDB) built on top of the Web. Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be accessed with SQL type queries. Translation tools convert Web documents to XML. Extraction tools extract desired information to place in first layer of MLDB. Higher levels contain more summarized data obtained through generalizations of the lower levels. 20

Multilevel Databases Text Image Audio Video Maps Games

Levels of A MLDB Layer 0 : – Unstructured, massive and global information base. Layer 1: – Derived from lower layers. – Relatively structured. – Obtained by data analysis, transformation & Generalization. Higher Layers (Layer n): – Further generalization to form smaller, better structured databases for more efficient retrieval.

Web Query System q These systems attempt to make use of: – Standard database query language – SQL – Structural information about web documents – Natural language processing for queries made in www searches. q Examples: – Web. Log: Restructuring extracted information from Web sources. – W 3 QL: Combines structure query (organization of hypertext) and content query (information retrieval techniques).

Architecture of a Global MLDB Source 1 Generalized Data Concept Hierarchy Source 2 Higher Levels . . . Source n Resource Discovery (MLDB) Knowledge Discovery

Personalization Web access or contents tuned to better fit the desires of each user. Manual techniques identify user’s preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles. 25

Applications Shop. Bot Bookmark Organizer Recommender Systems Intelligent Search Engines 26

Document Classification Supervised Learning – Supervised learning is a ‘machine learning’ technique for creating a function from training data. – Documents are categorized – The output can predict a class label of the input object (called classification). Techniques used are – Nearest Neighbor Classifier – Feature Selection – Decision Tree

Feature Selection Removes terms in the training documents which are statistically uncorrelated with the class labels Simple heuristics – Stop words like “a”, “an”, “the” etc. – Empirically chosen thresholds for ignoring “too frequent” or “too rare” terms – Discard “too frequent” and “too rare terms”

Document Clustering Unsupervised Learning : a data set of input objects is gathered Goal : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. Hypothesis : Given a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Hierarchical – Bottom-Up – Top-Down Partitional

Semi-Supervised Learning A collection of documents is available A subset of the collection has known labels Goal: to label the rest of the collection. Approach – Train a supervised learner using the labeled subset. – Apply the trained learner on the remaining documents. Idea – Harness information in the labeled subset to enable better learning. – Also, check the collection for emergence of new topics

Web-Structure Mining Generate structural summary about the Web site and Web page • Depending upon the hyperlink, ‘Categorizing the Web pages and the related Information @ inter domain level • Discovering the Web Page Structure. • Discovering the nature of the hierarchy of hyperlinks in the website and its structure.

Web-Structure Mining cont… Finding Information about web pages Retrieving information about the relevance and the quality of the web page. Finding the authoritative on the topic and content. Inference on Hyperlink The web page contains not only information but also hyperlinks, which contains huge amount of annotation. Hyperlink identifies author’s endorsement of the other web page.

Web Structure Mining Mine structure (links, graph) of the Web Techniques – Page. Rank – CLEVER Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages. 33

Web as a Graph Web pages as nodes of a graph. Links as directed edges. my page www. vesit. edu www. google. com 34

Link Structure of the Web Forward links (out-edges). Backward links (in-edges). Approximation of importance/quality: a page may be of high quality if it is referred to by many other pages, and by pages of high quality. 35

Authorities and Hubs Authority is a page which has relevant information about the topic. Hub is a page which has collection of links to pages about that topic. a 1 a 2 h a 3 a 4 36

Page. Rank Introduced by Brin and Page (1998). Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page. Used in Google Search Engine. Web search result is returned in the rank order. Treats link as like academic citation. Assumption: Highly linked pages are more ‘important’ than pages with a few links. 37

Page. Rank Used by Google Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it – Backlinks. Weighting is used to provide more importance to backlinks coming form important pages. 38

Page. Rank: Main Idea A page has a high rank if the sum of the ranks of its back-links is high. Google utilizes a number of factors to rank the search results: – proximity, anchor text, page rank The benefits of Page Rank are the greatest for underspecified queries, example: ‘Mumbai University’ query using Page Rank lists the university home page the first. 39

Basic Idea Back-links coming from important pages convey more importance to a page. For example, if a web page has a link from the yahoo home page, it may be just one link but it is a very important one. A page has high rank if the sum of the ranks of its back-links is high. This covers both the case when a page has many back-links and when a page has a few highly ranked back-links. 40

Definition A page’s rank is equal to the sum of all the pages pointing to it. 41

Simplified Page. Rank Example Rank(u) = Rank of page u , where c is a normalization constant (c < 1 to cover for pages with no outgoing links). 42

Expanded Definition R(u): page rank of page u c: factor used for normalization (<1) Bu: set of pages pointing to u Nv: outbound links of v R(v): page rank of site v that points to u E(u): distribution of web pages that a random surfer periodically jumps (set to 0. 15) 43

Problem 1 - Rank Sink Page cycles pointed by some incoming link. Loop will accumulate rank but never distribute it. 44

Problem 2 - Dangling Links In general, many Web pages do not have either back links or forward links. Dangling links do not affect the ranking of any other page directly, so they are removed until all the Page. Ranks are calculated. 45

Page. Rank (cont’d) PR(p) = c (PR(1)/N 1 + … + PR(n)/Nn) – PR(i): Page. Rank for a page i which points to target page p. – Ni: number of links coming out of page i 46

HITS Hyperlink-Induces Topic Search Based on a set of keywords, find set of relevant pages – R. Identify hub and authority pages for these. – Expand R to a base set, B, of pages linked to or from R. – Calculate weights for authorities and hubs. Pages with highest ranks in R are returned. 47

Authorities and Hubs Authority is a page which has relevant information about the topic. Hub is a page which has collection of links to pages about that topic. a 1 a 2 h a 3 a 4 48

Authorities and Hubs (cont. ) Good hubs are the ones that point to good authorities. Good authorities are the ones that are pointed to by h good hubs. 1 a 1 h 2 a 2 h 3 a 4 h 5 a 496

Finding Authorities and Hubs First, construct a focused sub-graph of the www. Second, compute Hubs and Authorities from the sub-graph. 50

Construction of Sub-graph Topic Search Engine Rootset Pages Crawler Expanded set Pages Forward link pages Rootset 51

Root Set and Base Set Use query term to collect a root set of pages from textbased search engine (Lycos, Altavista ). 52

Root Set and Base Set (cont. ) Expand root set into base set by including (up to a designated size cut-off): – All pages linked to by pages in root set – All pages that link to a page in root set 53

Hubs & Authorities Calculation Iterative algorithm on Base Set: authority weights a(p), and hub weights h(p). – Set authority weights a(p) = 1, and hub weights h(p) = 1 for all p. – Repeat following two operations (and then re-normalize a and h to have unit norm): h(v 1) v 1 h(v 2) v 2 h(v 3) v 3 p p v 1 a(v 1) v 2 a(v 2) v 3 a(v 3) 54

Example 0. 45, 0. 45 Hub 0. 45, Authority 0. 45, 0. 45 55

Example (cont. ) 0. 45, 0. 9 1. 35, 0. 9 Hub 0. 9, Authority 0. 45, 0. 9 56

Algorithmic Outcome Applying iterative multiplication (power iteration) will lead to calculating eigenvector of any “non-degenerate” initial vector. Hubs and authorities as outcome of process. Principal eigenvector contains highest hub and authorities. 57

Results Although HITS is only link-based (it completely disregards page content) results are quite good in many tested queries. From narrow topic, HITS tends to end in more general one. Specific of hub pages - many links can cause algorithm drift. They can point to authorities in different topics. Pages from single domain / website can dominate result, if they point to one page - not necessarily a good authority. 58

Possible Enhancements Use weighted sums for link calculation. Take advantage of “anchor text” - text surrounding link itself. Break hubs into smaller pieces. Analyze each piece separately, instead of whole hub page as one. Disregard or minimize influence of links inside one domain. IBM expanded HITS into Clever; not seen as viable real-time search engine. 59

CLEVER Method q CLient–side Eigen. Vector-Enhanced Retrieval q Developed by a team of IBM researchers at IBM Almaden Research Centre q Ranks pages primarily by measuring links between them q Continued refinements of HITS ( Hypertext Induced Topic Selection) q Basic Principles – Authorities, Hubs – Good hubs points to good authorities – Good authorities are referenced by good hubs http: //www. almaden. ibm. com/projects/clever. shtml

Problems Prior to CLEVER q Textual content that is ignored leads to problems caused by some features of web: – HITS returns good resources for more general topic when query topics are narrowly-focused – HITS occasionally drifts when hubs discuss multiple topics – Usually pages from single Web site take over a topic and often use same html template therefore pointing to a single popular site irrelevant to query topic http: //www. almaden. ibm. com/projects/clever. shtml

CLEVER: Solution q Extension 1: Anchor Text – using text that surrounds hyperlink definitions (href’s) in Web pages, often referred as ‘anchor text’ – boost weight enhancements of links that occur near instances of query terms q Extension 2: Mini Hubs/Pagelets – breaking large hub into smaller units – treat contiguous subsets of links as mini-hubs or ‘pagelets’ – contiguous sets of links on a hub page are more focused on single topic than the entire page http: //www. almaden. ibm. com/projects/clever. shtml

CLEVER: The Process q Starts by collecting a set of pages q Gathers all pages of initial link, plus any pages linking to them q Ranks result by counting links q Links have noise, not clear which pages are best q Recalculate scores q Pages with most links are established as most important, links transmit more weight q Repeat calculation no. of times till scores are refined http: //www. almaden. ibm. com/projects/clever. shtml

CLEVER Identify authoritative and hub pages. Authoritative Pages : – Highly important pages. – Best source for requested information. Hub Pages : – Contain links to highly important pages. 64

CLEVER The CLEVER algorithm is an extension of standard HITS and provides an appropriate solution to the problems that result from standard HITS. CLEVER assigns a weight to each link based on the terms of the queries and end-points of the link. It combines anchor text to set weights to the links as well. Moreover, it breaks large hub pages into smaller units so that each hub page is focused on as a single topic. Finally, in the case of a large number of pages from a single domain, it scales down the weights of pages to reduce the probabilities of overhead weights 65

Page. Rank vs. HITS Page. Rank (Google) – computed for all web pages stored in the database prior to the query – computes authorities only – Trivial and fast to compute HITS (CLEVER) – performed on the set of retrieved web pages for each query – computes authorities and hubs – easy to compute, but real-time execution is hard 66

Web Usage Mining Performs mining on Web Usage data or Web Logs A web log is a listing of page reference data also called as a click steam Can be seen from either server perspective – better web site design Or client perspective – prefetching of web pages etc. 67

Web Usage Mining Applications Personalization Improve structure of a site’s Web pages Aid in caching and prediction of future page references Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) Improve web server performance (Load Balancing) 68

Web Usage Mining Activities Preprocessing Web log – – – Cleanse Remove extraneous information Sessionize Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery – Count patterns that occur in sessions – Pattern is sequence of pages references in session. – Similar to association rules Transaction: session Itemset: pattern (or subset) Order is important Pattern Analysis 69

Web Usage Mining Issues Identification of exact user not possible. Exact sequence of pages referenced by a user not possible due to caching. Session not well defined Security, privacy, and legal issues 70

Web Usage Mining - Outcome Association rules – Find pages that are often viewed together Clustering – Cluster users based on browsing patterns – Cluster pages based on content Classification – Relate user attributes to patterns 71

Web Log Cleansing Replace source IP address with unique but non-identifying ID. Replace exact URL of pages referenced with unique but non-identifying ID. Delete error records and records containing not page data (such as figures and code) 72

Data Structures Keep track of patterns identified during Web usage mining process Common techniques: – Trie – Suffix Tree – Generalized Suffix Tree – WAP Tree 73

Web Usage Mining – Three Phases http: //www. acm. org/sigs/sigkdd/explorations/issue 1 -2/srivastava. pdf

Phase 1: Pre-processing Converts the raw data into the data abstraction necessary for the further applying the data mining algorithm – Mapping the log data into relational tables before an adapted data mining technique is performed. – Using the log data directly by utilizing special pre-processing techniques. 75

Raw data – Web log Click stream: a sequential series of page view request User session: a delimited set of user clicks (click stream) across one or more Web servers. Server session (visit): a collection of user clicks to a single Web server during a user session. Episode: a subset of related user clicks that occur within a user session. 76

Phase 2: Pattern Discovery uses techniques such as statistical analysis, association rules, clustering, classification, sequential pattern, dependency Modeling. 77

Phase 3: Pattern Analysis A process to gain Knowledge about how visitors use Website in order to – Prevent disorientation and help designers to place important information/functions exactly where the visitors look for and in the way users need it. – Build up adaptive Website server 78

79

Techniques for Web usage mining q Construct multidimensional view on the Weblog database – Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc. Perform data mining on Weblog records – Find association patterns, sequential patterns, and trends of Web accessing – May need additional information, e. g. , user browsing sequences of the Web pages in the Web server buffer Conduct studies to – Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping

Software for Web Usage Mining q WEBMINER : – introduces a general architecture for Web usage mining, automatically discovering association rules and sequential patterns from server access logs. – proposes an SQL-like query mechanism for querying the discovered knowledge in the form of association rules and sequential patterns. Web. Log. Miner – Web log is filtered to generate a relational database – Data mining on web log data cube and web log database

WEBMINER q SQL-like Query q A framework for Web mining, – Association rules: using Apriori algorithm 40% of clients who accessed the Web page with URL /company/products/product 1. html, also accessed /company/products/product 2. html – Sequential patterns: 60% of clients who placed an online order in /company/products/product 1. html, also placed an online order in /company/products/product 4. html within 15 days

Web. Log. Miner q Database construction from server log file: – data cleaning – data transformation q Multi-dimensional web log data cube construction and manipulation q Data mining on web log data cube and web log database

Mining the World-Wide Web q Design of a Web Log Miner – Web log is filtered to generate a relational database – A data cube is generated from the database – OLAP is used to drill-down and roll-up in the cube – OLAM is used for mining interesting knowledge Web log Database Data Cube Sliced and diced cube 1 Data Cleaning 2 Data Cube Creation 3 OLAP Knowledge 4 Mining