Data Engineering for ML in Power BI Leakage

Data Engineering for ML in Power BI Leakage and No Predictive Value George Sturrock

Background • Microsoft continues to integrate their Azure Cloud Machine Learning capabilities with Power BI. • Key Influencers • Auto ML in Power BI Dataflows • Data Engineering is a critical step in a machine learning pipeline • Machine learning models require data to be prepared in specific ways for the ML models to function and produce optimal results. Data Engineering Model Training Model Testing Model Deployment

The Presentation Series • This is the first in a multipart set of presentations which will walk through some of the more important concepts in Data Engineering for machine learning within Power BI. • Data from NASA’s Kepler satellite mission will be utilized for this series. • The end goal is to prepare data for a machine learning algorithm which will attempt to classify whether or not a Kepler Object of Interest is or is not a planet. • Presentation List • • • Leakage and No Predictive Value Missing Values Scaling Data Categorical Data Highly Correlated Data

What is Leakage? • In statistics and machine learning, leakage occurs when known data about past events is used to influence a statistical model • Leakage features can be used to predict the outcome on their own. • Kepler Examples of Leakage • Koi_pdisposition • Kepler_name • Koi_score

Features with No or Low Predictive Value • Features with zero or very low variance • Examples • All NULLS, All zeroes, etc… • Features with very high variance • Examples • Row IDs, Surrogate Keys, Unique Identifiers, free-form text • Kepler Examples of No or Low Predictive Value • • All Zeroes: Koi_longp, Koi_ingress, Koi_model_dof, koi_model_chisq, … Unique Identifiers: rowid, kepid Free-Form Text: koi_comment, koi_limbdark_mod, koi_parm_prov, koi_trans_mod, … Zero Variance: koi_vet_stat, koi_vet_date, koi_disp_prov, koi_ldm_coeff 3, …

Summary • Leakage allows features which can predict the outcome of an observation on their own • Leakage features hide the truly predictive variables from being identified by a machine learning algorithm. • Produces an algorithm which cannot predict • Features with no predictive value create a muddled picture for a machine learning algorithm to interpret • Increase computational resources • Can lessen the calculated importance of the truly predictive variables in certain algorithms