Web Usage Mining Part2 Web Usage Mining Applications
Web Usage Mining Part-2
Web Usage Mining Applications
Clustering of Web users
Clustering of Web Users Web-access log of a first-year computingscience course Students come from a wide variety of backgrounds Students’ attitudes toward the course also vary a great deal Profile of visits will reflect distinctions between students
Clustering of Web Users Hypothesize the visitor categories as: Workers: working on class or lab assignments or accessing the discussion board Studious: download the current set of notes to study notes on a regular basis Crammers: download a large set of notes for pretest cramming
Data Preparation Track a visit by studying the time interval between visits from the same IP address. A visit starts when the first request is made from an IP address. The visit continues as long as the consecutive requests from the IP address have a sufficiently small delay. Assume that a delay longer than 10 minutes means a new visit. Even if it is the same user that comes back after staying away for 10 minutes, he or she is most likely coming back to the website with a new objective.
Representation of a Visitor 1. 2. 3. 4. 5. On-campus/off-campus access. Daytime/nighttime access: 8 a. m. to 8 p. m. was considered to be the daytime. Access during lab/class days or non-lab/class days: All the labs and classes were held on Tuesday and Thursday. The visitors on these days were more likely to be workers. Number of hits. Number of class notes downloaded.
Representation of a Visitor First ten lines of the doc. Visists. txt
Representation of a Visitor Frequency of binary attributes of visitors with no class-notes downloads
Representation of a Visitor
Representation of a Visitor Frequency of binary attributes of visitors with class-notes downloads
Representation of a Visitor
Representation of a Visitor
Clustering of Visitors to Course Site java Kmeans v 8076 to. Cluster 5 D. txt 8076 5 3100 v 8076 to. Cluster 5 D. txt 8, 076 is the input file records 5 columns per record should be grouped into 3 clusters 100 iterations
Clustering of Visitors to Course Site
Attributes of three clusters obtained with five variables
Clustering of Visitors to Course Site
Clustering of Visitors to Course Site Attributes of three clusters obtained with two variables
Web Personalization Click. World (Baglioni 2003)
Classification of Web Users A web-personalization project called Click. World used classification as the main underlying technique Aimed at extracting models of the navigational behavior of users Analysis of five month access logs from an Italian web portal Includes web contents of national interest: news, forums, and a humor section In addition, more than 30 websites of local interests
Predicting Interests of Users by Means of a Classification Model Classification problem: predicting whether a user session includes access to a channel based on the accesses to the other channels Can be used to predict the potential interest of a new user in a channel he or she has not visited yet to suggest another channel of interest to new users
Predicting Interests of Users by Means of a Classification Model Also predict which users may be interested in a new channel prominently display the new channel to these possibly interested users More than 215, 000 user sessions Each session represented using The chosen channel is the output or class attribute The other channels chosen in the session are input variables Each attribute has a binary value
Classification Model Baglioni used the C 4. 5 classification algorithm to predict whether a user would be interested in visiting a section of the website, based on the sections the user already visited.
Precision and Recall Precision measures: What proportion of predictions is correct P=Ret X Rel /Ret Recall measures: What percentage of channels the user is interested in, which is identified as such by the classification model. R= Ret X Rel/ Rel Low recall means the model is not able to identify all channels the user is interested in.
Association Mining of Web Usage
Association Mining of Web Usage (Batisa and Silvab 2002) Publico On-Line is a daily online newspaper in Portugal The association mining addressed essentially the same question as the one answered by the earlier classification exercise which category of articles is requested by the same visitor? Market-basket analysis: find groups of items that are frequently referred together Here, a transaction is the web request and the item is the news section that the article is from
Association Mining of Web Usage
Association Mining of Web Usage The association mining results showed strong associations between the following pairs: Politics and Society Politics and International News Politics and Sports Society and International News Society and Local Lisbon Society and Sports Society and Culture Sports and International News
Sequence Pattern Analysis
Sequence-Pattern Analysis of Web Logs The data in web-access logs are intrinsically sequential We used data-visualization techniques: aggregate navigation using Pathalizer navigation in individual sessions using Stat. Viz Analyzing sequences of web requests is an important area of research msnbc. com anonymous web data
Sequence-Pattern Analysis of Web Logs First twenty lines of msnbc. com data (Note: The third line wraps around twice) http: //kddics. uci. edu/databases/msnbc. html
Sequence-Pattern Analysis of Web Logs
Sequence-Pattern Analysis of Web Logs First and last ten lines of freq 2. txt
Sequence-Pattern Analysis of Web Logs A sample of lines from sortedfreq 2. txt
Sequence-Pattern Analysis of Web Logs A sample of lines from sortedfreq 3. txt
Example of Analysis The pairs that stay within the same category have higher frequencies Once a user starts reading an article from one category, he or she is likely to access another article from the same category The pairs wherein one of the categories is 1 (corresponding to front page) tend to have higher frequencies users first coming to the front page and then following links to other categories
Example of Analysis Pair (6, 7): first pair with two distinct categories, neither of which is front page 6 is “on-air” and Category 7 is “misc. ” The high frequency for this pair suggests that the site should provide easy navigation from “on-air” to “misc. ”
Summary IN this chapter we tried to study access logs to determine what users want. We studied analysis and visualization tools. We studied mining applications on web access logs.
- Slides: 39