Web Mining By Pawan Singh Piyush Arora Pooja

Web Mining By. Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh 1 Praveen Kumar

Outline w Introduction w Web Mining w Web Content Mining w Web Structure Mining w Web Usage Mining w Conclusion & Exam Questions 2

Four Problems w Finding relevant information ü Low precision-which is due to the irrelevance of many of the search results. This ü results in a difficulty finding the relevant information. LOW RECALL which is due to the inability to index all the information available on the web. This results in a difficulty finding the unindexed information that is relevant. w Creating new knowledge out of available information on the web ü While the problem above is a query-triggered process (retrieval oriented), this problem is a data-triggered process. 3

Personalizing the information ü Catering to personal preference in content and presentation(associated with the type and presentation of the information ) Learning about the consumers ü What does the customer want to do? ü Using web data to effectively market products and/or services 4

Other Approaches Web mining is NOT the only approach w Database approach (DB) w Information retrieval (IR) w Natural language processing (NLP) ü In-depth syntactic and semantic analysis w Web document community ü Standards, manually appended meta-information, maintained directories, etc 5

Direct vs. Indirect Web Mining w Web mining techniques can be used to solve the information overload problems: Ø Directly Attack the problem with web mining techniques E. g. newsgroup agent classifies news as relevant Ø Indirectly Used as part of a bigger application that addresses problems E. g. used to create index terms for a web search service 6

The Research w Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) w Focusing on research from the machine learning point of view 7

Web Mining: Definition w “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data. ” üCan be viewed as four subtasks üNot the same as Information Retrieval üNot the same as Information Extraction 8

Web Mining: Subtasks w Resource finding ü Retrieving intended documents w Information selection/pre-processing ü Select and pre-process specific information from retrieved web resources. w Generalization ü Discover general patterns within and across web sites w Analysis ü Validation and/or interpretation of mined patterns 9

Web Mining: Not IR w Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the nonrelevant documents as possible w Web document classification, which is a Web Mining task, could be part of an IR system (e. g. indexing for a search engine) 10

Web Mining: Not IE w Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select relevant documents. üIE systems for the general Web are not feasible üMost focus on specific Web sites or content 11

IE - IR Information Retrieval üAutomatic retrieval of relevant documents üPrimary Goals: o. Indexing Text o. Searching for useful documents in a collection o“Bag of unordered words” o“Web document classification “ task is an instance of IR Information Extraction üExtract relevant facts from documents üPrimary Goals: o. Transform collection of retrieved documents to information. o. Structure of representation of a document o“Web document classification “ task is an instance of IR o. IE has a higher level of granularity o. Result: o. Structured Database o. Compression or summary of Text or documents 12

Types of IE ü I E from unstructured texts ( Classical) • Unstructured ? ? Free texts eg. News stories • Basic to deep linguistic preprocessing. ü IE from semi-structured texts (Structural) • Semi-Structured ? ? HTML • Uses meta-information eg. HTML tags Wrapper Induction, Machine learning used to build systems (semi-)automatically 13

Web Mining and Machine Learning w Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn". w Web mining is NOT learning from the Web. w Some applications of machine learning on the web are NOT Web Mining w Methods used for Web Mining are NOT limited to machine learning w There is a close relationship between web mining and machine learning 14

Web Mining and Machine Learning • Machine learning techniques support and help web mining as they could be applied to the processes in the web mining. • For example, recent research shows that applying machine learning techniques could improve the text classification process compared to the traditional IR techniques. • In short, web mining intersects with the application of the machine learning on the web. 15

Web Mining Categories w Web Content Mining ü Discovering useful information from web contents/data/documents. w Web Structure Mining ü Discovering the model underlying link structures (topology) on the Web. E. g. discovering authorities and hubs w Web Usage Mining ü Make sense of data generated by surfers ü Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc. 16

Web Content Data Structure w Unstructured – free text w Semi-structured – HTML w More structured – Table or Database generated HTML pages w Multimedia data – receive less attention than text or hypertext 17

Web Structure Mining w Interested in the structure between Web documents (not within a document) w Example: Page. Rank – Google w Application: Discovering micro-communities in the Web w Measuring the “completeness” of a Web site 18

Web Usage Mining w Tries to predict user behavior from interaction with the Web w Wide range of data (logs) ü Web client data ü Proxy server data ü Web server data w Two common approaches ü Map usage data into relational tables before using adapted data mining techniques ü Use log data directly by utilizing special pre-processing techniques 19

Thank you! 20