Processing and Analyzing Large log from Search Engine
























- Slides: 24
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012
Web-browsing data social network communications sensor data ->Behavior data Google and Facebook, for example, are Big Data companies. Big data Challenges • Big data processing • Extracting useful information that reflects user behavior from massive log • Instance data management • Data analysis 2 Opportuni ties Behavior data (like web log) can be used for improving and supporting business processes. Data mining, process mining and so on
Analytic applications Data Machine Process Reporting Mining Learning Mining Cassandra Cloud computing (Map/Reduce Framework) Big Data processing Big Data Access Cloud Storage BI/ Hive Unstructured Data 3 Instance data No. SQL Distributed File Key-value Database(HBase System(HDFS) , Cassandra, Mongo. DB) Raw data
Case study: Search Engine Company • News, Page, Image, Maps, Music, navigation Dataset: 66 million clicks in one month, 2. 2 million clicks per day ->generate behavior in 10 minutes User Behavior: • Visiting path (Referer) • Searching result effectiveness • Abs Clicking Behavior • Source and Destination of User visiting • Robot Behavior Reorganization and Analysis • Visiting page layout • Behavior comparison and product improvement • User grouping and recommendation 4
Data features • It contains massive information in a well recorded format • Large scale with big growing potential • Real-time analysis 5
existing tools Data extracting: XESame,Prom Import Cloud Storage /no rational DB Extracting data from cloud Instance data(XES) Process Mining : Pro. M 1) Due to large data set, analysing has low speed and in most situations it got crash 2) Offline analysis-> real-time analysis 6
System Structure Understandable model Extracting useful information that reflects user behavior from massive log Log processing 7
Convert raw log to instance data(event log) with Map/Reduce 8
9
CPU: Intel Xeon 2. 40 GHZ RAM: 2 GB 14 Nodes file. Size 10 log. Num One. PCTime Map. Reduce. Time Map. Num Reduce. Num 8. 84 MB 36422 5 s, 921 ms 7 s 3 15 65. 8 M 218177 30 s, 846 ms 25 s 3 15 112 M 772241 48 s, 559 ms 30 s 3 15 One day(371 M) 2, 200, 000 2. 5 minutes 1. 3 minutes 40 15 One week 15, 000 2. 5 minutes 280 15 One month 66, 000 20 Minutes (Expected ) 2 hours (Expected ) 6 minutes 1200 15
Process Discovery One instance/case is defined as one visitor’s one time visiting. • IP+UA • Cookie. ID Activity varies based on different requirements Alpha miner Heuristic miner Fuzzy miner Sequence model 11
Behavior analysis User behavior pattern Interaction between channels range activity Data selection all Content. Type Web Map vising path webpage layout all Referer/URL news Content. Type+Page Type+Block (Channel =news)AND( Page. Type=19 5) image Content. Type+Page Type+Block (Channel =image)AND( Page. Type=43 5) Searching result Behavior grouping Registration all 12
User behavior pattern Interaction between channels range activity Data selection all Content. Type Web Map vising path webpage layout all Referer/URL news Content. Type+Page Type+Block (Channel =news)AND( Page. Type=19 5) image Content. Type+Page Type+Block (Channel =image)AND( Page. Type=43 5) Searching result Behavior grouping Registration all 13
Behavior analysis User behavior pattern Interaction between channels range activity Data selection all Content. Type Web Map vising path webpage layout all Referer/URL news Content. Type+Page Type+Block (Channel =news)AND( Page. Type=19 5) image Content. Type+Page Type+Block (Channel =image)AND( Page. Type=43 5) Searching result Behavior grouping Registration all 14
Active visitor’s visiting path 15
Behavior analysis User behavior pattern Interaction between channels range activity Data selection all Content. Type Web Map vising path webpage layout all Referer/URL news Content. Type+Page Type+Block (Channel =news)AND( Page. Type=19 5) image Content. Type+Page Type+Block (Channel =image)AND( Page. Type=43 5) Searching result Behavior grouping Registration all 16
Main page 17
18
Sequence model 19
` 20
XES statistics 21
Conclusion It is a nice project to get into data analysis field , with the combination of web data analysis, process mining and cloud computing technology. Future work: 1 More algorithms and technologies should be applied to this data set. 2 Behavior comparison and user recommendation still need to be accomplished. 3 Can process mining analyze the behavior that does not have a certain pattern. 1 Log Sampling 2 Detect the incorrectness from logs before applying log to analysis technologies. 3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame. 22
feedback 1 What is the real questions? 2 Why process mining? 23
Thank you ! Meng Dou 13/9/2012