Web Structure Mining by A Surasit Samaisut Copyrights

Introduction n The advent of the World-Wide Web has overwhelmed the typical home computer user, from individual users to companies that do business over the web, with an enormous flood of information n To be able to cope with the abundance of available information, users of the WWW need to rely on intelligent tools that assist them in finding, sorting, and filtering the available information

Mining Data on the Web Raw Data Overloaded Data Patterns/Information Selected Information Knowledge Mining to Knowledge

Data and Web Mining n Data mining aims at discovering valuable information that is hidden in conventional databases n Web mining aims at finding and extracting relevant information that is hidden in web-related data, in particular in text documents that are published on the web

Web Mining Review n Like data mining, web mining is a multidisciplinary effort that draws techniques from fields like information retrieval, statistics, machine learning, natural language processing, and others n Web mining is a new research areas that tries to applying techniques from data mining and machine learning to web data

Web Mining Review n Depending on the nature of the data, one can distinguish three main areas of research within the web mining community Web Mining Web Content Mining • • • Text Image Audio Video Structured Reccords Web Structure Mining Web Usage Mining • Hyperlinks • Document Structured • Web Server Logs • Application Level Logs • Application Server Logs

Web Mining Review n Web Content Mining n n Web Structure Mining n n Application of data mining techniques to unstructured or semi-structured data Use of the hyperlink structure of the web as an (additional) information source Web Usage Mining n Analysis of user interactions with a web server (e. g. click-stream analysis)

The Objectives n To generate structural summary about the web site and web page n To discover the link structure of the hyperlinks at the inter-document level

Hyperlinks n While conventional information retrieval focuses primarily on information that is provided by the text of web documents n The web provides additional information through the way in which different documents are connected to each other via “Hyperlinks”

Hyperlinks n Is a connection between an element (word, phrase, image) in a document to somewhere else in the same document, or to a different destination on the web n Usually in a different color or underline to the rest of document, and activated by mouse click

Hyperlink Example n Visited link n Link n Texts n Mouse click link

Basic HTML Hyperlinks n Text Link n n <a href="http: //www. surasit. com/"> This is a Link </a> Image Link n <a href="http: //www. surasit. com/"> <img src="URL" alt=“Link"> </a>

Basic HTML Hyperlinks n E-Mail Link n n <a href="mailto: info@surasit. com"> Send e-mail </a> Link within webpage A named anchor: <a name="top"> n <a href="#top"> Jump to the top </a> n

Link Structure n Link structure of the web has been important since the creation of the first crawlers such as World Wide Web Worm n Link structure analysis come into existence as a result of two algorithm Page Rank n HITS n

Link Structure – Page Rank n Page Rank (PR); is an algorithm that assigns a global importance rating to web pages based upon link structure alone n Principle: “More important web pages will be more frequently linked to”

Search Engine – Page Rank n Although the search returns several million pages, the most relevant pages are usually found within the top ten or twenty pages in the list of results

Search Engine – Page Rank n How does the search engine know which pages are the most important? n Search Engine assigns a number to each individual page, expressing its importance n This number is known as Page Rank n Sometimes called PR number

Page Rank – PR n The computation of Page Rank is calculated by “Power Iteration” n Google webmaster tool bar n *Additional reading provided

Page Rank – Advantages n Fighting spam n A page is important if the pages pointing to it are important n Since it is not easy for web page owner to add in-links into his/her page from other important pages, it is thus not easy to influence Page Rank

Page Rank – Advantages n Page Rank is a standard global measurement approach n Page Rank is query independent n Page Rank values of all the pages are computed and saved off-line rather than at the query time

Page Rank – Criticism n Query-independence n It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic

Link Structure – HITS n Hypertext Induced Topic Search n Unlike Page Rank which is a static ranking algorithm n HITS is search query dependent

HITS n When the user issues a search query, n HITS first expands the list of relevant pages returned by a search engine n Then produces two rankings of the expanded set of pages n Authority Ranking n Hub Ranking

HITS – Authority Ranking n Is a page with many in-links n The idea is that the page may have good or authoritative content on some topics n Thus many people trust it and link to it

HITS – Hub Rank n Is a page with many out-links n The page serves as an organizer of the information on a particular topic n Points to many good authority pages on the topic

Authority VS. Hub Page n Authority Page n Hub Page

The Key Idea of HITS n A good hub points to many good authorities n A good authority is pointed to by many good hubs n Authorities and hubs have a mutual reinforcement relationship

The Key Idea of HITS n Shows some densely linked authorities and hubs n A densely linked set of authorities and hubs Authorities Hubs

The Key Idea of HITS n HITS works and assigns every page in an authority score and a hub score n The computation of authority scores and hub scores is the same as the computation of the Page Rank scores n Using Power Iteration

HITS – Advantages n Its ability to rank pages according to the query topic n Able to provide more relevant authority and hub pages

HITS – Disadvantages n It is easily spammed; n n Topic drift; n n Easy to influence HITS since adding out-links in one’s own page is so easy Many pages in the expanded set may not be on topic Inefficiency at query time; n The query time evaluation is slow; collecting the root set, expanding it and performing computation are all expensive operations

Summary n Based on the topology of hyperlinks; web structure mining will categorize the web pages and generate the information n For instant; the similarity and relationship between a number of different websites

Web Structure Mining n Another task of web structure mining is to discover the nature of network of hyperlink in the websites of a particular domain n Help to generalize the flow of information in websites n Therefore the query processing will be easier and more efficient

Web Structure Mining n If a web page is linked to another web page directly n The relationship among those web pages maybe fall in one of the types Related by synonyms n Have similar contents n Sit in the same web server n

Problem of Web links n There are many inflated link counts since the web is unregulated n There is nothing to stop authors from creating millions of pages that link to wherever they choose

Data Cleansing – Solution n Data cleansing is mainly concerned with eliminating duplicate or similar links n One technique of data cleansing is; n Alternative Document Models (ADMs)

ADMs n The implication is that duplicate links from the same document could be ignored, even if source and/or target URLs are not identical n To remove duplicate and similar pages n Avoid overloading web servers n Detect spam (page flooding)

Example of ADMs n One author may place an entire book online in a single huge page n Whereas another could split a similar one into thousands of individual page n ADMs is used to replicate features of web page to be unique link

Relation with Web content mining n Web Content Mining; n n Mainly focuses on the structure of innerdocument Web Structure Mining; n Tries to discover the link structure of the hyperlinks at inter-document level

Relation with Web content mining n Web Structure Mining has a nature relation with the web content mining, since it is very likely that the web documents contain links, and they both use the real or primary data on the web n It's quite often to combine these two mining tasks in an application n Web content contained in a hyperlinked community with both accurate and relevant

Web Structure Mining Applications n Web personalization n Understanding the user behavior n Improving the web site structure and content