ACM KDD Cup A Survey 1997 2011 Qiang

  • Slides: 34
Download presentation
ACM KDD Cup A Survey: 1997 -2011 Qiang Yang 杨强 (partly based on Xinyue

ACM KDD Cup A Survey: 1997 -2011 Qiang Yang 杨强 (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science and Technology 香港科大 1

About KDD Cup (1997 – 2011) n Competition is a strong mover for Science

About KDD Cup (1997 – 2011) n Competition is a strong mover for Science and Engineering: n ACM Programming Contest n n World College level Programming skills ROBOCUP n World Robotics Competition 2

About ACM KDDCUP n n ACM KDD: Premiere Conference in knowledge discovery and data

About ACM KDDCUP n n ACM KDD: Premiere Conference in knowledge discovery and data mining ACM KDDCUP: n n Worldwide competition in conjunction with ACM KDD conferences. It aims at: n n n showcase the best methods for discovering higher-level knowledge from data. Helping to close the gap between research and industry Stimulating further KDD research and development 3

Statistics § Participation in KDD Cup grew steadily § Average person-hours per submission: 204

Statistics § Participation in KDD Cup grew steadily § Average person-hours per submission: 204 Max person-hours per submission: 910 Year Submissions 97 98 16 21 99 24 2000 2005 2011 30 32 1000+ 4

Algorithms (up to 2000) 5

Algorithms (up to 2000) 5

KDD Cup 97 n n A classification task – to predict financial services industry

KDD Cup 97 n n A classification task – to predict financial services industry (direct mail response) Winners n n n Charles Elkan, a Prof from UC-San Diego with his Boosted Naive Bayesian (BNB) Silicon Graphics, Inc with their software Mine. Set Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System 6

Mine. Set (Silicon Graphics Inc. ) n A KDD tool that combines data access,

Mine. Set (Silicon Graphics Inc. ) n A KDD tool that combines data access, transformation, classification, and visualization. 7

KDD Cup 98: CRM Benchmark n URL: n www. kdnuggets. com/meetings/kd d 98/kdd-cup-98. html

KDD Cup 98: CRM Benchmark n URL: n www. kdnuggets. com/meetings/kd d 98/kdd-cup-98. html A classification task – to analyze fund raising mail responses to a non-profit organization n Winners n n n Urban Science Applications, Inc. with their software Gain. Smarts. SAS Institute, Inc. with their software SAS Enterprise Miner ™ Quadstone Limited with their software Decisionhouse ™ 8

KDDCUP 1998 Results Maximum Possible Profit Line ($72, 776 in profits with 4, 873

KDDCUP 1998 Results Maximum Possible Profit Line ($72, 776 in profits with 4, 873 mailed) Mail to Everyone Solution ($10, 560 in profits with 96, 367 mailed) Gain. Smarts SAS/Enterprise Miner Quadstone/Decisionhouse

ACM KDD Cup 1999 n n URL: www. cse. ucsd. edu/users/elkan/ kdresults. html Problem

ACM KDD Cup 1999 n n URL: www. cse. ucsd. edu/users/elkan/ kdresults. html Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders Data: from Do. D Winners n SAS Institute Inc. with their software Enterprise Miner. n Amdocs with their Information Analysis Environment 10

KDDCUP 2000: Data Set and Goal: Data collected from Gazelle. com, a legwear and

KDDCUP 2000: Data Set and Goal: Data collected from Gazelle. com, a legwear and legcare Web retailer n Pre-processed n. Training set: 2 months n Test sets: one month n Data collected includes: n n Click streams Order information n n The goal – to design models to support website personalization and to improve the profitability of the site by increasing customer response. Questions - When given a set of page views, n n n characterize heavy spenders characterize killer pages characterize which product brand a visitor will view in the remainder of the session? 11

KDDCUP 2000: The Winners n n n Question 1 & 5 Winner: Amdocs Question

KDDCUP 2000: The Winners n n n Question 1 & 5 Winner: Amdocs Question 2 & 3 Winner: Salford Systems Question 4 Winner: esteam 12

KDD Cup 2001 n 3 Bioinformatics Tasks n Dataset 1: Prediction of Molecular Bioactivity

KDD Cup 2001 n 3 Bioinformatics Tasks n Dataset 1: Prediction of Molecular Bioactivity for Drug Design n n half a gigabyte when uncompressed Dataset 2: Prediction of Gene/Protein Function (task 2) and Localization (task 3) n n Dataset 2 is smaller and easier to understand 7 megabytes uncompressed n A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization. 13

2001 Winners n Task 1, Thrombin: n n n Jie Cheng (Canadian Imperial Bank

2001 Winners n Task 1, Thrombin: n n n Jie Cheng (Canadian Imperial Bank of Commerce). Bayesian network learner and classifier Task 2, Function: Mark-A. Krogel (University of Magdeburg). n n n Task 2: n n the genes of one particular type of organism A gene/protein can have more than one function, but only one localization. Inductive Logic programming Task 3, Localization: Hisashi Hayashi, Jun Sese, and Shinichi Morishita (University of Tokyo). n K nearest neighbor 14

n molecular biology : Two tasks n n Task 1: Document extraction from biological

n molecular biology : Two tasks n n Task 1: Document extraction from biological articles Task 2: Classification of proteins based on gene deletion experiments n Winners: n Task 1: Clear. Forest and Celera, USA n n Yizhar Regev and Michal Finkelstein Task 2: Telstra Research Laboratories , Australia n Adam Kowalczyk and Bhavani Raskutti 15

2003 KDDCUP n Information Retrieval/Citation Mining of Scientific research papers n n n based

2003 KDDCUP n Information Retrieval/Citation Mining of Scientific research papers n n n based on a very large archive of research papers First Task: predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference Second Task: a citation graph of a large subset of the archive from only the La. Tex sources Third Task: each paper's popularity will be estimated based on partial download logs Last Task: devise their own questions 16

2003 KDDCUP: Results n Task 1: n n n Task 2: n n n

2003 KDDCUP: Results n Task 1: n n n Task 2: n n n 1 st place: David Vogel AI Insight Inc. Task 3: n n n Claudia Perlich, Foster Provost, Sofus Kacskassy New York University Janez Brank and Jure Leskovec Jozef Stefan Institute, Slovenija Task 4: n n n Amy Mc. Govern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen University of Massachusetts Amherst, USA 17

2004 Tasks and Results n 粒子物理学和同�蛋白 ���(Particle physics; plus protein homology prediction) n 两个子任�的冠�分�

2004 Tasks and Results n 粒子物理学和同�蛋白 ���(Particle physics; plus protein homology prediction) n 两个子任�的冠�分� �:David S. Vogel, Eric Gottschalk, and Morgan C. Wang以及 Bernhard Pfahringer, Yan Fu (付岩), Rui. Xiang Sun, Qiang Yang (�强), Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao. 18

Past KDDCUP Overview: 2005 -2010 Year Host Task Technique Winner 2005 Microsoft Web query

Past KDDCUP Overview: 2005 -2010 Year Host Task Technique Winner 2005 Microsoft Web query categorization Feature Engineering, Ensemble HKUST (沈抖, �强,等) 2006 Siemens Pulmonary emboli detection Multi-instance, Non-IID sample, Cost sensitive, Class Imbalance, Noisy data AT&T, Budapest University of Technology & Economics 2007 Netflix Consumer recommendation Collaborative Filtering, Time series, Ensemble IBM Research, Hungarian Academy of Sciences 2008 Siemens Breast cancer detection from medical images Ensemble, Class imbalance, Score calibration IBM Research, National Taiwan University 2009 Orange Customer relationship Feature selection, prediction in telecom Ensemble IBM Research, University of Melbourne 2010 PSLC Data Shop Student performance prediction in ELearning National Taiwan University (CJ Lin, S. Lin, etc. ) Feature engineering, Ensemble, Collaborative filtering

KDDCUP’ 11 Dataset n n 11 years of data Rated items are n Tracks

KDDCUP’ 11 Dataset n n 11 years of data Rated items are n Tracks n Albums n Artists n Genres Items arranges in a taxonomy Two tasks Track 1 Track 2 #ratings 263 M #items 625 K 296 K #users 1 M 249 K

Items in a Taxonomy

Items in a Taxonomy

Track 1 Details

Track 1 Details

Track 1 Highlights n n n Largest publicly available dataset Large number of items

Track 1 Highlights n n n Largest publicly available dataset Large number of items (50 times more than Netflix) Extreme rating sparsity (20 times more sparse than Netflix) Taxonomy can help in combating sparsely rated items. Fine time stamps with both date and time allow sophisticated temporal modeling.

Track 2 Details

Track 2 Details

Track 2 Highlights n n n Performance metric focus on ranking/ classification, which differs

Track 2 Highlights n n n Performance metric focus on ranking/ classification, which differs from traditional collaborative filtering. No validation data provided, need to selfconstruct binary labeled data from rating data. Unlike track 1, track 2 removed time stamps to focus more than long term preference rather than short term behaviors.

Submission Stats

Submission Stats

Winners Track 1 Track 2 1 st place National Taiwan University 2 nd place

Winners Track 1 Track 2 1 st place National Taiwan University 2 nd place Commendo (Netflix Prize Winnder) Chinese Academy of Science, Hulu Labs 3 rd place Hong Kong University of Science and Technology, Shanghai Jiaotong University Commendo (Netflix Prize Winnder)

Chinese Teams at KDDCUP (NTU, CAS, HKUST)

Chinese Teams at KDDCUP (NTU, CAS, HKUST)

Key Techniques n Track 1: n n n Blending of multiple techniques Matrix factorization

Key Techniques n Track 1: n n n Blending of multiple techniques Matrix factorization models Nearest neighbor models Restricted Bolzmann machines Temporal modelings Track 2: n n n Importance sampling of negative instances Taxonomical modelings Use of pairwise ranking objective functions

Summary n To place on top of KDDCUP requires n n Team work Expertise

Summary n To place on top of KDDCUP requires n n Team work Expertise in domain knowledge as well as mathematical tools Often done by world famous institutes and companies Recent trends: n n n Dataset increasingly more realistic Participants increasingly more professional Tasks are increasingly more difficult 30

Summary n n n KDD Cup is an excellent source to learn the state-of-art

Summary n n n KDD Cup is an excellent source to learn the state-of-art KDD techniques KDDCUP dataset often becomes the standard benchmark for future research, development and teaching Top winners are highly regarded and respected 31

References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS 97

References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS 97 -557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 from http: //www. kdnuggets. com/meetings/kdd 98/quadstone/i ndex. html Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http: //www. kdnuggets. com/meetings/kdd 98/gainkddcup 98 -release. html Georges, J. & Milley, A. (1999). KDD’ 99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http: //www. cse. ucsd. edu/users/elkan/saskdd 99. pdf Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http: //www. cse. ucsd. edu/users/elkan/KDD 2. doc 32

References (Cont. ) Sebastiani P. , Ramoni M. & Crea A. (1999). Profiling your

References (Cont. ) Sebastiani P. , Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http: //bayesware. com/resources/tutorials/kddcup 99/kddcu p 99. pdf Inger A. , Vatnik N. , Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http: //www. ecn. purdue. edu/KDDCUP/amdocs-slides-1. ppt Neumann E. , Vatnik N. , Rosset S. , Duenias M. , Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http: //www. ecn. purdue. edu/KDDCUP/amdocs-slides-5. ppt Salford System white papers: http: //www. salford-systems. com/whitepaper. html Summary talk presented at KDD (2000) http: //robotics. stanford. edu/~ronnyk/kdd. Cup. Talk. ppt 33

References (cont) n n n http: //www. cs. wisc. edu/~dpage/kddcup 2001/Cheng. pdf http: //www.

References (cont) n n n http: //www. cs. wisc. edu/~dpage/kddcup 2001/Cheng. pdf http: //www. cs. wisc. edu/~dpage/kddcup 2001/Krogel. pdf http: //www. cs. wisc. edu/~dpage/kddcup 2001/Hayashi. pdf 34