Predicting Students Drop Out a Casestudy Gerben Dekker
Predicting Students Drop Out: a Casestudy Gerben Dekker, Mykola Pechenizkiy and Jan Vleeshouwers
The Case Study • Educational Data Mining in a practical setting • Directed to a student advice procedure • Eindhoven University of Technology, Electrical Engineering department
The Case Study: advice procedure EXAMS Exam results September October November December HOLIDAY January EXAMS 30% Talks with students etc. STUDENTS Pre-university student information ADVICE DEADLINE 70% July 2009 PAGE 3
Outline • CRISP-DM Framework • • • Understanding of context Data understanding Data preparation Modeling Evaluation Deployment • Conclusions and further work July 2009 PAGE 4
CRISP-DM Framework • • • Understanding of context Data understanding Data preparation Modeling Evaluation Deployment July 2009 PAGE 5
Understanding of context • Situation at Electrical Engineering, Eindhoven University of Technology • 40% dropout rate, small inflow • Decision to dropout preferably before end of January • Study advice by student counselor • Objective for the department: • More robust and objective advices July 2009 PAGE 6
Understanding of context • In data mining terms: • Build model for academic success of a student • Based on the currently available information • Only information until December of year of enrollment. • Objective for research: • Try out applicability EDM in this context: − Enough data (amount)? − Enough data (type)? July 2009 PAGE 7
Data understanding • Data source • Institutions’ database − Pre-university data − University data • Resulting data • Data from 648 students, from 2001 -2009 July 2009 PAGE 8
Data preparation (pre-university data) • Standard preparatory education: • # courses • Type of courses taken • Average grades for total, science, and math • Non-standard previous education: • Type • Grade July 2009 PAGE 9
Data preparation (university data) • Courses, grades, # attempts • Many transformations needed: • Reorganizations • Partial exams • Example: Calculus • 2000 -2001: 1 examination • 2001 -2006: 2 partial examinations • 2007 -2008: 5 partial examinations, or 1 examination. July 2009 PAGE 10
Modeling (general) • Classification task • 2 classification • Criterion: finish all courses of first year in three years • Several mining techniques applied • Decision trees (+ensembles), bayesian classifiers, association rules • Separate university/pre-university data first July 2009 PAGE 11
Modeling (pre-university data) • Base line model • One rule classifier • 68% accuracy using Science_mean • No significant improvement using other classification techniques July 2009 PAGE 12
Modeling (university data) • Base line model • One rule classifier • 75% accuracy using Linear algebra AB • Significant improvements using other models (80%) • Decision trees slightly better than other models July 2009 PAGE 13
Modeling (total set) • Accuracies 80%, using attributes from both subsets • Improvements using cost matrices • Shape misclassification • Small trade-offs accuracy and misclassification: • Accuracy 79%, 52% of errors FP • Accuracy 76%, 41% of errors FP • Similarities between models • Linear Algebra AB always root node • Science Mean always high in tree July 2009 PAGE 14
Modeling (decision tree) Lin. Alg. AB 79% Accuracy < 5. 5 > 5. 5 1 Calc. A < 5. 15 > 5. 15 1 VWO_Sc_mean {n/a, poor, avg, above avg} 0 {good, excellent} 1 July 2009 PAGE 15
Evaluation • Detailed manual analysis by student counselor: • Review the classification measure: − 25% of False Negatives should be true negatives − How to classify skilled people who leave? • Improve data transformations July 2009 PAGE 16
Deployment • Objectives • More robust and objective advices: − 80% accuracy is possible, clear directions for improvements. • Try out applicability EDM in this context: − Enough data (amount)? − Yes, and more is not easily obtainable − Enough data (type)? − Would probably be very useful, but costly. • Deployment possible after improvements July 2009 PAGE 17
Conclusions and further work • EDM can help in a study advice process: 80% accuracy is possible, clear directions for improvements. • EDM can work using small datasets and a limited amount of data categories • Further work: • Improve data transformations • Improve classification measure: better twoclass, move to three-class • Review use of additional data July 2009 PAGE 18
Questions?
- Slides: 19