Overview of Web Data Mining and Applications Part
Overview of Web Data Mining and Applications Part II Bamshad Mobasher De. Paul University
What is Web Mining Definition application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources. 2
Types of Web Mining Web Content Mining Web Usage Mining 3 Web Structure Mining
Types of Web Mining Web Content Mining Web Usage Mining Extracting interesting patterns from user interactions with resources on one or more Web sites 4 Web Structure Mining
Types of Web Mining Web Content Mining Web Usage Mining Applications: • user and customer behavior modeling • Web site optimization • e-customer relationship management • Web marketing • targeted advertising • Personalization 5 Web Structure Mining
Data Mining and Personalization i Personalization: “Killer App” for big data analytics i Tangible successes both in the research and in industrial applications 4 recommender systems 4 personalized Web agents 4 user adaptive systems 4 Web marketing & targeted advertising 4 personalized search i Sophisticated modeling approaches based on both predictive and unsupervised DM techniques 6
Web Usage Mining : : data sources i Typical Sources of Data: 4 automatically generated Web/application server access logs 4 e-commerce and product-oriented user events (e. g. , shopping cart changes, product clickthroughs, etc. ) 4 user profiles and/or user ratings 4 meta-data, page content, site structure i User Transactions 4 sets or sequences of pageviews possibly with associated weights 4 a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser 7
What’s in a Typical Server Log? 8
Typical Fields in a Log File Entry client IP address base url date/time http method file accessed protocol version status code bytes transferred referrer page user agent 1. 2. 3. 4 maya. cs. depaul. edu 2006 -02 -01 00: 08: 43 GET /classes/cs 589/papers. html HTTP/1. 1 200 (successful access) 9221 http: //dataminingresources. blogspot. com/ Mozilla/4. 0+(compatible; +MSIE+6. 0; +Windows+NT+5. 1; +SV 1; +. NET+CLR+2. 0. 50727) In addition, there may be fields corresponding to • login information • client-side cookies (unique keys, issued to clients in order to identify a repeat visitor) • session ids issued by the Web or application servers 9
Basic Entities in Web Usage Mining i User (Visitor) - Single individual that is accessing files from one or more Web servers through a Browser i Page File - File that is served through HTTP protocol i Pageview - Set of Page Files that contribute to a single display in a Web Browser i User Session - Set of Pageviews served due to a series of HTTP requests from a single User across the entire Web. i Server Session - Set of Pageviews served due to a series of HTTP requests from a single User to a single site i Transaction (Episode) - Subset of Pageviews from a single User or Server Session 10
Main Challenges in Data Collection and Preprocessing i. Main Questions: 4 what data to collect and how to collect it; what to exclude 4 how to identify requests associated with a unique user sessions (HTTP is “stateless”) 4 how to identify/define user transactions 4 how to identify what is the basic unit of analysis (e. g. , pageviews, items purchased, user ratings, etc. ) 4 how to integrate data across channels: e-commerce data, clickstream data, user profiles, social media data, product meta data, etc. 11
Usage Data Preparation Tasks i Data cleaning 4 remove irrelevant references and fields in server logs 4 remove references due to spider navigation 4 add missing references due to client-side caching i Data integration 4 synchronize data from multiple server logs 4 integrate e-commerce and application server data 4 integrate meta-data i Data Transformation 4 pageview identification 4 identification of product-oriented events 4 identification of unique users 4 sessionization – partitioning each user’s record into multiple sessions or transactions (usually representing different visits) 4 integrating meta-data and user profile data with user sessions 12
Conceptual Representation of User Transactions or Sessions Pageview/objects Sessions/user transactions This is the typical representation of the data, after preprocessing, that is used for input into data mining algorithms. Raw weights may be binary, based on time spent on a page, or other measures of user interest in an item. In practice, need to normalize or standardize this data. 13
Web Usage Mining as a Process 14
E-Commerce Data i Integrating E-Commerce and Usage Data 4 Needed for analyzing relationships between navigational patterns of visitors and business questions such as profitability, customer value, product placement, etc. 4 E-business / Web Analytics 4 E. g. , tracking and analyzing conversion of browsers to buyers i E-Commerce v. Simple Usage Data 4 E-commerce data is product oriented while usage data is pageview oriented 4 Usage events (pageviews) are well defined and have consistent meaning across all Web sites 4 E-commerce events are often only applicable to specific domains, and the definition of certain events can vary from site to site 4 Major difficulty for Usage events is getting accurate preprocessed data 4 Major difficulty for E-commerce events is defining and implementing the events for a particular site 15
Why We Need Web Analytics i Are we attracting new people to our site? i Is our site ‘sticky’? Which regions in it are not? i What is the health of our lead qualification process? i How adept is our conversion of browsers to buyers? i What behavior indicates purchase propensity? i What site navigation do we wish to encourage? i How can profiling help use cross-sell and up-sell? i How do customer segments differ? i What attributes describe our best customers? i Can we target other prospects like them? i What makes customers loyal? i How do we measure loyalty? 16
Three Skill Sets Required i Technology 4 How do we get the data? Are we collecting the right data? Data Collection / Preprocessing / Integration i Analytics 4 How do we turn the data into insightful information? Analysis Tools, OLAP, Data Mining i Business Management 4 What action do we take? How do we measure the impact of that action? E-Metrics 17
Using Analytics for E-Business Management i Navigation Calibration i Calculating Content 4 Popularity Refresh rate <1? 4 Freshness Visit Frequency 4 Stickiness / Slipperiness / Leakage 4 Stimulus - Inducement i Conversion Quotient i Interaction Computation i Customer Service Assessment i Customer Experience Evaluation i Branding 18
Web Usage and E-Business Analytics Different Levels of Analysis i. Session Analysis i. Static Aggregation and Statistics i. OLAP i. Data Mining 19
Session Analysis i Simplest form of analysis: examine individual or groups of server sessions and e-commerce data. i Advantages: 4 Gain insight into typical customer behaviors. 4 Trace specific problems with the site. i Drawbacks: 4 LOTS of data. 4 Difficult to generalize. 20
Static Aggregation (Reports) i Most common form of analysis. i Data is aggregated by predetermined units such as days or sessions. i Generally gives most “bang for the buck. ” i Advantages: 4 Gives quick overview of how a site is being used. 4 Minimal disk space or processing power required. i Drawbacks: 4 No ability to “dig deeper” into the data. 21
Online Analytical Processing (OLAP) i Allows changes to aggregation level for multiple dimensions. i Generally associated with a Data Warehouse. i Advantages & Drawbacks 4 Very flexible 4 Requires significantly more resources than static reporting. 22
Data Mining: Going Deeper i Frequent Itemsets and Association Rules 4 The “Donkey Kong Video Game” and “Stainless Steel Flatware Set” product pages are accessed together in 1. 2% of the sessions. 4 When the “Shopping Cart Page” is accessed in a session, “Home Page” is also accessed 90% of the time. 4 When the “Stainless Steel Flatware Set” product page is accessed in a session, the “Donkey Kong Video” page is also accessed 5% of the time. 4 30% of clients who accessed /special-offer. html, placed an online order in /products/software/ i Sequential Patterns 4 Add an extra dimension to frequent itemsets and association rules - time h “x% of the time, when AB appears in a transaction, C appears within z transactions”) 4 40% of people who bought the book “How to cheat IRS” booked a flight to South America 6 months later 4 The “Video Game Caddy” page view is accessed after the “Donkey Kong Video Game” page view 50% of the time. This occurs in 1% of the sessions. 4 15% of visitors followed the path home > * > software > * > shopping cart > checkout 23
Data Mining: Going Deeper i Clustering: Content-Based or Usage-Based 4 Customer/visitor segmentation 4 Categorization of pages and products i Classification 4 Classifying users into behavioral groups (browser, likely to purchase, loyal customer, etc. ) 4 Examples: h Cusotmers who access Video Game Product pages, have income of 50 K+, and have 1 or more children, should get a banner ad for Xbox in their next visit. h Customers who make at least 4 purchases in one year should be categorized as “loyal” h Load applicants in 45 K-60 K income range, low debt, and good-excellent credit should be approved for a new mortgage. 24
Example: Path Analysis for Ecommerce Visit 10% 90% No Search (64% successful) Avg sale per visit: $X Avg sale per visit: 2. 2 X 70% 30% Last Search Failed Last Search Succeeded Avg sale per visit: 0. 9 X Avg sale per visit: 2. 8 X 25
Example: Association Analysis for Ecommerce Product Fully Reversible Mats Association Egyptian Cotton Towels Lift 456 Website Recommended Confidence Products 41% J Jasper Towels Confidence 1. 4% White Cotton T-Shirt Bra Plunge T-Shirt Bra 246 25% Black embroidered underwired bra Confidence 1% i Confidence: 41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towels i Lift: People who purchased Fully Reversible Mats were 456 times more likely to purchase the Egyptian Cotton Towels compared to the general population 26
Web Usage Mining: clustering example i Transaction Clusters: 4 Clustering similar user transactions and using centroid of each cluster as a usage profile (representative for a user segment) Sample cluster centroid from dept. Web site (cluster size =330) Support URL Pageview Description 1. 00 /courses/syllabus. asp? course=45096 -303&q=3&y=2002&id=290 SE 450 Object-Oriented Development class syllabus 0. 97 /people/facultyinfo. asp? id=290 Web page of a lecturer who thought the above course 0. 88 /programs/ Current Degree Descriptions 2002 0. 85 /programs/courses. asp? depcode=96 &deptmne=se&courseid=450 SE 450 course description in SE program 0. 82 /programs/2002/gradds 2002. asp M. S. in Distributed Systems program description 27
Site Content Analysis Module Web/Application Server Logs Basic Framework for E-Commerce Data Analysis Data Cleaning / Sessionization Module Data Integration Module Integrated Sessionized Data E-Commerce Data Mart Usage Analysis OLAP Tools OLAP Analysis Data Cube Site Map customers orders products Site Dictionary Operational Database Data Mining Engine Pattern Analysis
- Slides: 28