Data Blending and Predictive Analytics with KNIME Tips

Data Blending and Predictive Analytics with KNIME Tips and Tricks for Higher Education Dr. Paul Prewitt-Freilino Director of Institutional Research, Wheaton College Nathan Rush Institutional Research Analyst, Wheaton College

Institutional Research at Wheaton Two-person IR office, limited resources No data warehouse A decade of frozen (Excel) files for admissions, enrollment, registration, student sections, etc. Changes in file formatting over time Column names, column order, value formats Park Hall, Wheaton College Wheaton IR is increasing its capacity Requests for new analyses Instructional activity, campus climate survey, enrollment trends by instructor Move from consultants to in-house analytics for enrollment/retention modeling

Data Blending with KNIME Census “warehouse” – stacked file of censuses with data from additional sources Application information, graduation rates/degrees awarded, financial aid

Data Blending with KNIME List Files nodes and Loops to specify groups of files to be read, manipulated, and joined to other files Group. By to dedup, Numeric Binner to classify, and Joiner to combine files based on matched data in one or more fields Math Formula, Rule Engine, and String Manipulation to calculate, recode, and modify different types of data (Integer, Double, String)

Data Preparation with KNIME National Student Clearinghouse Results Create unduplicated lists of students filtered by any combination of most recent institution attended, institution with longest continuous enrollment, enrollment status, institution type, etc. IPEDS Reporting Connect census and graduation files with validation tables (if necessary) to calculate values (e. g. completions by CIP code, race/ethnicity, and gender) Use String Manipulation to create a fixed length file of values and formatting, to be saved as. txt for import/upload

Predictive Modeling with KNIME Developing a pre-enrollment model for predicting retention to Year 2 Analysis of first time, full-time cohorts Use of multiple models (Logistic Regression, Decision Tree, Tree Ensemble/Random Forests™) to predict retention Assessment of accuracy statistics (Scorer) and adjustment of prediction confidence threshold (Rule Engine) Write predictions and confidence values to Excel or directly to an SQL database (e. g. Banner)

Machine Learning with KNIME Process to create mathematical transformations of numeric variables Transformations and untransformed variable are correlated to DV (e. g. Year 2 retention) Only most correlated version is retained and passed to modeling nodes Most correlated transformation of a given IV may change when new data are added and workflow is reset

Future Directions with KNIME Connecting to Banner (Database nodes) Would eliminate need to run applicant extract from Banner and download through FTP each morning for admission dashboard Use of APIs (REST, Google Analytics, Twitter nodes) Find home values for modeling, analyze web site traffic. . . Connect to sites (e. g. Qualtrics) instead of downloading results or uploading files to refresh panels/contacts Data formatting for Tableau Simulating cube data, working with multiple delimiters via Cell Splitter nodes

Thank You wheatoncollege. edu/institutional-research ir@wheatoncollege. edu