KDD Cup 2004 Winning Model for Task 1

  • Slides: 18
Download presentation
KDD Cup 2004 Winning Model for Task 1: Particle Physics Prediction David S. Vogel:

KDD Cup 2004 Winning Model for Task 1: Particle Physics Prediction David S. Vogel: MEDai / AI Insight, University of Central Florida Eric Gottschalk: MEDai / AI Insight Morgan C. Wang: University of Central Florida Orlando, FL

What did we know? Given 12 million numbers. No information given about what these

What did we know? Given 12 million numbers. No information given about what these numbers represent. No knowledge of particle physics. Predict 100, 000 ones and zeros.

Unsuccessful Modeling Packages Software #1: Tree-based boosting algorithms Software #2: Logistic Regression and Neural

Unsuccessful Modeling Packages Software #1: Tree-based boosting algorithms Software #2: Logistic Regression and Neural Networks Software #3: Support Vector Machines Software #4: Rule-finding algorithms

Key Modeling Tools MITCH (Multiple Intelligent Tasking Computer Heuristics) – Used for its visualizations,

Key Modeling Tools MITCH (Multiple Intelligent Tasking Computer Heuristics) – Used for its visualizations, variable analysis, transformations, Neural Networks, and scoring tools. NICA (Numerical Interaction CAlibrator) – Used to detect interactions within the data.

Category Analysis Values of Variable #63 {-8, -2, 1, 14} N 2350 (4. 7%)

Category Analysis Values of Variable #63 {-8, -2, 1, 14} N 2350 (4. 7%) # Class 0 0 # Class 1 2350 {8, 2, -14} 2294 (4. 6%) 2294 0 Nearly one tenth of records are 100% predictive.

Investigation of Variables Group 1: 8 variables with values {-1, 0, 1}. Interactive and

Investigation of Variables Group 1: 8 variables with values {-1, 0, 1}. Interactive and symmetric. Group 2: A key nominal variable. Group 3: 6 individually predictive variables. Group 4: All others variables, no correlation to dependent variable.

Complete Interaction Search Variable 1 Variable 2 Z-Score V 01 V 04 25. 9

Complete Interaction Search Variable 1 Variable 2 Z-Score V 01 V 04 25. 9 V 65 V 66 24. 79 V 01 V 66 18. 51 V 04 V 78 18. 27 V 05 V 08 16. 88 V 04 V 76 16. 71 V 23 V 66 16. 66 V 19 V 66 16. 46 : : :

Class 1 Probability Predictor V 01: r=. 006 V 01

Class 1 Probability Predictor V 01: r=. 006 V 01

Class 1 Probability Predictor V 01 where V 04=1 V 01

Class 1 Probability Predictor V 01 where V 04=1 V 01

Class 1 Probability Predictor V 01 where V 04=-1 V 01

Class 1 Probability Predictor V 01 where V 04=-1 V 01

Class 1 Probability Predictor V 04*(V 01 -0. 75): r=. 23 V 01

Class 1 Probability Predictor V 04*(V 01 -0. 75): r=. 23 V 01

Interactions between variables: Red: Extremely Strong Green: Strong Yellow: Moderate (p<. 01)

Interactions between variables: Red: Extremely Strong Green: Strong Yellow: Moderate (p<. 01)

Details of 639 Predictors Majority of original variables (after null value replacement) 100% predictive

Details of 639 Predictors Majority of original variables (after null value replacement) 100% predictive groups High volume categories of the nominal variable 2 variables indicating null values 72 first order interactions 185 second order interactions 301 third order interactions

Model Details 40, 000 training cases 10, 000 validation cases MITCH Self-Organizing Neural Network

Model Details 40, 000 training cases 10, 000 validation cases MITCH Self-Organizing Neural Network “Bernoulli” function optimization generally performed the best Generalized extremely well on validation set, considering the number of variables Small secondary model based on residuals

Customization Severe penalty for incorrect probabilities of 0 or 1: a “googol”!!! “Gimmees” forced

Customization Severe penalty for incorrect probabilities of 0 or 1: a “googol”!!! “Gimmees” forced to be at 0. 995 or 0. 005. Accept 9300 tiny penalties to avoid risking “disaster. ” 14 teams had a “disaster. ” Remaining predictions truncated at 0. 01 and 0. 99 to compensate for over-fitting at extremes.

Customization (continued) Q-Score predictions were maximized by retraining with a “creative” optimization function: (Predicted

Customization (continued) Q-Score predictions were maximized by retraining with a “creative” optimization function: (Predicted – Actual) ^ 6. Predictions re-calibrated using the function:

Where do we go from here? Accuracy -- independent of content Scientific & Industry

Where do we go from here? Accuracy -- independent of content Scientific & Industry Applications

Questions?

Questions?