Clickstream analysis data collection preprocessing and mining using

  • Slides: 21
Download presentation
Clickstream analysis - data collection, preprocessing and mining using the LISp-Miner system A case

Clickstream analysis - data collection, preprocessing and mining using the LISp-Miner system A case study approach Effective placement of on-line advertisments Tomáš Kliegr 7. 3. 2007

Methodology – CRISP DM 2

Methodology – CRISP DM 2

I. Data collection • Data are collected on the server application layer • No

I. Data collection • Data are collected on the server application layer • No demands on the tracked website 3

Comparison with log-file based approaches • Works with all browsers with enabled cookies •

Comparison with log-file based approaches • Works with all browsers with enabled cookies • Automatic robot filtering • Storage efficiency • Easy to integrate & safe to operate 4

II. Data preprocessing Problem: collected click streams have varying lengths. Goal: create higher-level abstraction

II. Data preprocessing Problem: collected click streams have varying lengths. Goal: create higher-level abstraction of the visitor This phase creates a fixed-length visitor’s profile in a two step process Segment procedure: classifies pages into a domain specific taxonomy on several levels of granularity. Merge procedure: extracts important and characteristic information from visitor’s clickstream. 5

Assigning pages to categories Visited pages Prespecified taxonomy (tuples Product. ID - category, Tuples

Assigning pages to categories Visited pages Prespecified taxonomy (tuples Product. ID - category, Tuples URL pattern – category) (UR addresses Stored in a database) SQL Server SP Segment Pages classified on several levels of granularity 6

Segment procedure • Classifies pages into a domain specific taxonomy on several levels of

Segment procedure • Classifies pages into a domain specific taxonomy on several levels of granularity. • Assigns Time on page and Score to each page in visitor’s clickstream • Score expresses absolute weight of a particular page in user’s click stream. S = (ln(o) + 1)* t o – order of a page in users clickstream t – time on page 7

Segment – Example output Page www. poznani. cz/hiking-alps/ General category (Cat) Topic Search Alps

Segment – Example output Page www. poznani. cz/hiking-alps/ General category (Cat) Topic Search Alps Extended Category (ECat) Catalogue 8

Merge procedure This procedure creates the visitor profile: • Basic attributes (6): Total time

Merge procedure This procedure creates the visitor profile: • Basic attributes (6): Total time on web, Number of displayed pages, Day of week, Hour of day, Referring domain (constituted by URL and Cat attributes). • Important points on the path (12): Entry page, Exit page, Conversion page. (Page name, Cat, ECat and S). • Attributes conceptualizing the path (11): Range of interest, Most favourite topic (Topic, S), Search total (S) and Search analytically (Fulltext (S), Extended search (S), Catalogue Search (S)), General information pages total (S) and analytically (Discounts(S), Insurance (S), About (S)). 9

Merge – example output 10

Merge – example output 10

III. Datamining • Association Rules are the most frequently used approach [Facci, Lanza] •

III. Datamining • Association Rules are the most frequently used approach [Facci, Lanza] • LISp-Miner system - 4 ft-Miner, SD 4 ft. Miner • Sample task: From which referring class of websites do most converted visitors come? 11

Choosing the right quantifier • LISp-Miner offers a range of quantifiers • Founded implication

Choosing the right quantifier • LISp-Miner offers a range of quantifiers • Founded implication – Support a, a/(a+b+c+d) – Confidence a/(a+b) – Problem: tight dependencies rarely found and rarely required in clickstream data • Above average quantifier “Among objects satisfying Ant there at least 100*p per cent more objects satisfying Suc then there are objects satisfying Suc in the whole data matrix. ” LISp. Miner Help 12

Ilustrace Ant/Suc Conversion Not(Conversion) Partner webs 7 63 Not (PW) 7 693 Confidence threshold

Ilustrace Ant/Suc Conversion Not(Conversion) Partner webs 7 63 Not (PW) 7 693 Confidence threshold max. <= 7/(63+7) <= 0. 1 AAI threshold<= 0. 1/0. 018 <= 5. 555 [% of objects satisfying Suc and Ant] = 7/ 70 = 0. 1 [% of objects satisfying Suc in the entire data matrix] = 14/ 770 = 0. 018 LISP-Miner demonstration 13

SD 4 ft-Miner • Mines for patterns of the form /( , , )

SD 4 ft-Miner • Mines for patterns of the form /( , , ) • This SD 4 ft-Pattern means that the subsets given by Boolean attributes , differ in what concerns the relation of Boolean attributes , when condition is satisfied. • What groups of customers , (i. e. depending on where they come from) under what condition remarkably differ when it comes to the probability of conversion. • We express “the conversion condition” by setting only the succedent ( ) and we leave the antecedent unset. 14

15

15

4 ft Miner vs SD 4 ft-Miner, Above Average Quant. SD 4 ft-Miner, (neg.

4 ft Miner vs SD 4 ft-Miner, Above Average Quant. SD 4 ft-Miner, (neg. gace type for 2 nd subset) The value of increase in the conversion rate is more suitable for our purposes as the 2 nd set is disjunct with the 1 st set. The conversion rate for partner webs is 78% higher than is the average for other referrers Con 1/Conf 2= 0, 132/0, 074 = 1, 784 16

Solution to Task 1 From which referring class of websites do most converted visitors

Solution to Task 1 From which referring class of websites do most converted visitors come? 17

SD 4 Ft – cont. • If the output is sorted according to Difference

SD 4 Ft – cont. • If the output is sorted according to Difference of values of confidence • The first rule says: Conversion rate for visitors coming from partner websites is 13. 2%, while conversion rate for visitors coming from company’s own websites is only 4. 9%. 18

Review • The goal of the second run of the CRISPDM Cycle is to:

Review • The goal of the second run of the CRISPDM Cycle is to: • Extend available info - log user actions • Improve the heuristics for the Most favourite topic • Involve page texts • New development platform – Ferda boxes 19

20

20

References • • • Rauch, J. , Šimůnek, M. : An Alternative Approach to

References • • • Rauch, J. , Šimůnek, M. : An Alternative Approach to Mining Association Rules. In: Foundations of Data Mining and Knowledge Discovery. Berlin 2005 Rauch, J. , et al: Mining for Patterns Based on Contingency Tables by KL-Miner - First Experience. In: Foundations and Novel Approaches in Data Mining. Berlin: Springer, 2005 Strossa, P. , et al: Reporting Data Mining Results In a Natural Language. In: dtto Kováč, M. , et al: Ferda, New Visual Environment for Data Mining. Znalosti 2006 LM Report Asistent. Znalosti 2007 Lispminer. vse. cz, ferda. sourceforge. net/ 21