Big Data Project based on Movie and Actor
Big Data Project based on Movie and Actor Group 7 Member: 20293055 LI Derui 20292972 LIU Dan 20301735 ZHANG Haowei By 2016
Outline Introduction Data Collection Entity Resolution Data Fusion Demo Data Usage Data Source Data Crawing F-Swoosh Strategy Evalution Webpage GROUP 7
Introduction Data Usage Specific Query Oscar
Introduction Data Usage Statistics
Data Collection Data Source IMD B Data Source D B http: //www. imdb. com/year/ https: //themoviedb. org INSIDERhttp: //www. movieinsider. com/movies/-
Data Collection title directors stars type budget url Data Attributes awards duration alternative_titles imdb_rating Movie offical_page 15 attributes description release_time part_of revenue
Data Collection Data Attributes name Actor awards gender nationality title known_for alternative_names 10 attributes birth url official_page
Data Collection Details • Using BFS order to crawl data • Avoiding getting banned • Rotate agent Details • Disable cookies • Change download delays • Extract • Xpath • Regular Expression
Data Collection Data Size Data preprocess Data Set • Data Size Movies Actor Imdb 36395 45986 DB 24892 21018 Insider 1529 1829 • Data preprocessing (solving different representations) • Actor: name, nationality, birth • Movie: revenue, duration, awards
Entity Resolution Aim Entity Resolution Delete duplicates and integrate records to represent the same real-world entity from different sources
Entity Resolution Aim Entity Resolution Martha LIU 1994 -06 -04 Actor Martha L. 1994 -06 -04 Actor, Director Martha LIU; Martha L. 1994 -06 -04; 1994 -06 -04 Actor; Actor, Director
Entity Resolution Technical • F-swoosh: Match + Merge • Two features: • F 1: {name, birthday} Technical • F 2: {birth, known for, nationality} • Match: Jaro. Wrinkler Distance • Merge: Similarity(R 1. F 1, R 2. F 1)>=0. 80 or Similarity(R 1. F 2, R 2. F 2) >=0. 80
Entity Resolution String matching Jaro. Wrinkler Distance
Entity Resolution Merge R 5=merge<R 1, R 3> Merge R 3 R 4 R 1 R 2 Remove R 1 & R 2 R 5 R 4 R 2
Entity Resolution Technical • Two Hash Tables: records all previously seen feature values: Hf 1 & Hf 2 Reduce Redundancy • Example: • Hf 1: [(R 1. f 1 , R 1), (R 2. f 1 , R 2)] • current record = R 3: • if Hf 1(R 3. f 1) =R 1, then merge <R 3, R 1>, don’t have to match
Data Fusion Aim Data fusion Resolving conflicts from different sources and finding values that reflect the real world
Data Fusion Strategies Conflict Handling Strategies Conflict Ignorance Conflict Resolution Union Authority
Data Fusion Authority Actor name Movie title gender directors birth Release_time nationality duration budget revenue descriptions
Data Fusion Authority • Domain Authority: • Imdb weight=0. 96 • DB weight=0. 64 Domain Authority • Insider weight=0. 5 • Unknown weight=0. 01 • Example table Example 1 Example 2 IMDB 1993 Unknown DB 1992 USA INSIDER 1992 Unknown Final Result 1992 USA
Evaluation Comparison • Two Methods for Entity Resolution • R-swoosh Compare • F-swoosh
Evaluation Example R-swoosh
Evaluation Example F-swoosh:
Evaluation Final Result table Actor Movie Original 11829 11529 Final Result 8213 9655
Demo Final Webpage Demo
Q&A Introduction Data Collection Entity Resolution Data Fusion Demo Data Usage Data Source Data Crawing F-Swoosh Strategy Evalution Webpage GROUP 7
Thank You GROUP 7
- Slides: 26