Machine Learning Data Zillow Group Jasjeet Thind Vice

  • Slides: 16
Download presentation
Machine Learning & Data @ Zillow Group Jasjeet Thind Vice President, Data Science &

Machine Learning & Data @ Zillow Group Jasjeet Thind Vice President, Data Science & Engineering @Jasjeet. Thind

Agenda Zillow Group Machine Learning Use Cases Architectural Patterns Models Machine Learning Pipeline Data

Agenda Zillow Group Machine Learning Use Cases Architectural Patterns Models Machine Learning Pipeline Data Quality Free Zillow Data @Jasjeet. Thind

Zillow Group Build the world's largest, most trusted and vibrant home-related marketplace. @Jasjeet. Thind

Zillow Group Build the world's largest, most trusted and vibrant home-related marketplace. @Jasjeet. Thind

Machine Learning Use Cases Personalization Social & Content Ad Targeting Demographics & Community Zestimate

Machine Learning Use Cases Personalization Social & Content Ad Targeting Demographics & Community Zestimate (AVMs) Business Analytics Premier Agent (B 2 B) Forecasting home price trends Mortgages Deep Learning @Jasjeet. Thind

Architecture Real-Time Scoring APIs (Python, Flask) Data Collection Systems Ranking (Java/Python/SQL) (Spark) Featurization (Spark)

Architecture Real-Time Scoring APIs (Python, Flask) Data Collection Systems Ranking (Java/Python/SQL) (Spark) Featurization (Spark) Aggregate Features (Spark) Zillow Group Data Lake (AWS - S 3 / Kinesis) User Profiles (Spark / HBase) Wedge Counting Collaborative Filtering (Spark) @Jasjeet. Thind

Architectural Patterns Transport [Collect] Application (Backstage) [Process] Data Lake (AWS) [Store] Put object Kinesis

Architectural Patterns Transport [Collect] Application (Backstage) [Process] Data Lake (AWS) [Store] Put object Kinesis Firehose Stream Put object Analytics Get object ZG Data Lake (S 3) Application (batch) Get object Get records Put NRT records Serving System / Analytics [Answer] Application (near realtime) Database (Serving) Real-Time Scoring @Jasjeet. Thind

Machine Learning Models Random Forest K-means clustering Gradient Boosted Machines K-nearest neighbors CNN (Deep

Machine Learning Models Random Forest K-means clustering Gradient Boosted Machines K-nearest neighbors CNN (Deep Learning) Wedge Counting NLP / TF-IDF / Word 2 vec / Bag of Words Linear Regression @Jasjeet. Thind

Like vs. Dislike Predict homes per using behavior of similar users Spencer $19 M

Like vs. Dislike Predict homes per using behavior of similar users Spencer $19 M + + $22 M - ? Stan + $664 K Like = user actively engaged with property Dislike = user viewed property but weak engagement Feature Description uid unique id of user pid Property id first_visit timestamp or 0 num_views sigmoid(#views) time_spent time on page num_contacts # leads sent num_saves # saves on zpid num_shares # shares on zpid num_photos # photos viewed @Jasjeet. Thind

Wedge Count For all user & property pairs to form a prediction, perform wedge

Wedge Count For all user & property pairs to form a prediction, perform wedge count - http: //www. jmlr. org/proceedings/papers/v 18/kong 12 a. pdf Does Stan like $19 M? Spencer + $19 M + ? Stan Spencer + - ? Stan + 3 (wedge 03_cnt) $22 M $19 M Wedge # $664 k 5 (wedge 05_cnt) @Jasjeet. Thind

Gradient Boosting Classifier Normalize wedge counts for popular users / properties Prediction for all

Gradient Boosting Classifier Normalize wedge counts for popular users / properties Prediction for all user / property pairs Does Stan like the $19 M home? features (uid: Stan, pid: $19 M) (see right side) features wedge 00_cnt wedge 01_cnt wedge 02_cnt wedge 03_cnt wedge 04_cnt wedge 05_cnt wedge 06_cnt wedge 07_cnt wedge 00_norm_cnt wedge 01_norm_cnt wedge 02_norm_cnt wedge 03_norm_cnt wedge 04_norm_cnt wedge 05_norm_cnt wedge 06_norm_cnt wedge 07_norm_cnt @Jasjeet. Thind

User Profile Signals - website, mobile app, and search queries Binary classification - labels

User Profile Signals - website, mobile app, and search queries Binary classification - labels (like/dislike) same as wedge count model pid uid features label 0 or 1 (see right side) 0 or 1 User profile model determines preference scores 0_bed: 0 1_bed: 0. 01 2_bed: 0. 8 Features (categorical variables) Bath 0_bath, 0. 5_bath, 1_Bath, 1. 5_bath, 2. 5_bath, 3_bath Bed 0_bed, 1_bed, 2_bed, 3_bed, 4_bed, 5_bed Price 100_125_price, 125_150_price, 150_175_price Use Code condo, single_family, farm_land Zipcode zip_98109 3_bed: 0. 6 @Jasjeet. Thind

Ranking Property matrix feature space same as user profile Dot product of property matrix

Ranking Property matrix feature space same as user profile Dot product of property matrix with user profile vector Linear regression with additional features (e. g. age decay) 0_bed 1_bed 2_bed 3_bed uid_0 pid_0 1 0 0 0 pid_1 0 0. 01 0. 8 = pid_2 1 0 0. 8 0 pid_3 0 0 0 1 0. 6 (uid, pid) score {"u. Id": "10307499", "p. Id": "1044183744"} 0. 3364 @Jasjeet. Thind

Machine Learning Pipeline Collect user behavior and real-estate data, train the various models, generate

Machine Learning Pipeline Collect user behavior and real-estate data, train the various models, generate the candidate set, and make predictions. Spark job creates Hive table with user events (uid, pid) partitioned by date Event API (Java) User Behavior (Kinesis /S 3) Producer (Python) Public Record (Kinesis / S 3) Producer (Python) Active Listings (Kinesis / S 3) Filter (Spark) Recommend ations Hashmap (Redis) pid -> uid reverse index User Store Past and current user events Wedge Counting / User Profile Models (Python) Score (Spark) Wedge features or property features (user profile) Property Data Listing Data Training Data (Spark) Scoring Data (Spark) Training Set (S 3) Train Models (Spark) Scoring Set (S 3) @Jasjeet. Thind

Data Quality Analytical pipelines that measure • Data integrity • Attributes / outlier detection

Data Quality Analytical pipelines that measure • Data integrity • Attributes / outlier detection • Missing data • Expected # of records • Latency • Models - expected data Build reports / alerts that drive action @Jasjeet. Thind

Free Zillow Data Zillow. com/data Time Series: national, state, metro, county, city and ZIP

Free Zillow Data Zillow. com/data Time Series: national, state, metro, county, city and ZIP code levels Zillow Home Value Index (ZHVI) • • • Top / Middle / Bottom Thirds Single Family / Condo / Co-op Median Home Value Per Sq Ft Zillow Rent Index (ZRI) • Multi-family / SFR / Condo / Co-op • Median ZRI Per Sq ft • Median Rent List Price ZTRAX: Zillow Transaction and Assessment Dataset Previously inaccessible or prohibitively expensive housing data for academic and institutional researchers FOR FREE. ∙ More than 100 gigabytes ∙ 374 million detailed public records across more than 2, 750 U. S. counties ∙ 20+ years of deed transfers, mortgages, foreclosures, auctions, property tax delinquencies and more for residential and commercial properties. Other Metrics • Median List Price • Price-to-Rent ratio geographic information, and prior valuations on • Homes Foreclosed approximately 200 million parcels in more than • For-sale Inventory / Age Inventory 3, 100 counties. • Negative Equity • And many more… ∙ Assessor data including property characteristics, Email ZTRAX@zillow. com for more information @Jasjeet. Thind

Thank you! Related Blogs ● Zillow. com/data-science ● Trulia. com/blog/tech/ Hiring ● Machine Learning

Thank you! Related Blogs ● Zillow. com/data-science ● Trulia. com/blog/tech/ Hiring ● Machine Learning Engineer ● Data Scientist ● Product Manager ● Data Engineer @Jasjeet. Thind