CrossValidation vs Bootstrap Estimates of Prediction Error in
Cross-Validation vs. Bootstrap Estimates of Prediction Error in Statistical Modeling Kaniz Rashid Lubana Mamun MS Student: CSU Hayward Dr. Eric A. Suess Assistant Professor of Statistics: CSU Hayward
Regression Analysis
Regression Analysis 4 To find the regression line for data (xi, yi), minimize
Regression Analysis 4 To find the regression line for data (xi, yi), minimize 4 Estimates linear relationships between dependent and independent variables.
Regression Analysis 4 To find the regression line for data (xi, yi), minimize 4 Estimates linear relationships between dependent and independent variables. 4 Applications: Prediction and Forecasting.
Classical Regression Procedure 4 Choose a model: y = b 0 + b 1 x 1 + b 2 x 2 + e. 4 Verify assumptions: normality of the data. 4 Fit the model, checking for significance of parameters. 4 Check the model’s predictive capability.
Mean Squared Error of Prediction
Mean Squared Error of Prediction 4 MSEP measures how well a model predicts the response value of a future observation.
Mean Squared Error of Prediction 4 MSEP measures how well a model predicts the response value of a future observation. 4 For our regression model, the MSEP of a new observation yn + 1 is
Mean Squared Error of Prediction 4 MSEP measures how well a model predicts the response value of a future observation. 4 For our regression model, the MSEP of a new observation yn + 1 is 4 Small values of MSEP indicate good predictive capability.
What is Cross-Validation? 4 Divide the data into two sub-samples: – Treatment set (to fit the model), – Validation set (to assess predictive value). 4 Non-parametric approach: mainly used when normality assumption is not met. 4 Criterion for model’s prediction ability: usually the MSEP statistic.
CV For Linear Regression: The “Withhold-1” Algorithm 4 Use the model: y = b 0 + b 1 x 1 + b 2 x 2 + e.
CV For Linear Regression: The “Withhold-1” Algorithm 4 Use the model: y = b 0 + b 1 x 1 + b 2 x 2 + e. 4 Withhold one observation (x 1 i, x 2 i, yi).
CV For Linear Regression: The “Withhold-1” Algorithm 4 Use the model: y = b 0 + b 1 x 1 + b 2 x 2 + e. 4 Withhold one observation (x 1 i, x 2 i, yi). 4 Fit the regression model to the remaining n – 1 observations.
CV For Linear Regression: The “Withhold-1” Algorithm 4 Use the model: y = b 0 + b 1 x 1 + b 2 x 2 + e. 4 Withhold one observation (x 1 i, x 2 i, yi). 4 Fit the regression model to the remaining n – 1 observations. 4 For each i, calculate
CV For Linear Regression: The “Withhold-1” Algorithm 4 Use the model: y = b 0 + b 1 x 1 + b 2 x 2 + e. 4 Withhold one observation (x 1 i, x 2 i, yi). 4 Fit the regression model to the remaining n – 1 observations. 4 For each i, calculate 4 Finally, calculate
What is the Bootstrap? 4 The Bootstrap is: – A computationally intensive technique, – Involves simulation and resampling. 4 Used here to assess the accuracy of statistical estimates for a model: – Confidence intervals, – Standard errors, – Estimate of MSEP.
Algorithm For a Bootstrap
Algorithm For a Bootstrap 4 From a data set of size n, randomly draw B samples with replacement, each of size n.
Algorithm For a Bootstrap 4 From a data set of size n, randomly draw B samples with replacement, each of size n. 4 Find the estimate of MSEP for each of the B samples:
Algorithm For a Bootstrap 4 From a data set of size n, randomly draw B samples with replacement, each of size n. 4 Find the estimate of MSEP for each of the B samples: 4 Average these B estimates of q to obtain the overall bootstrap estimate:
Schematic Diagram of Bootstrap Θ(x 1*) X 1* Θ(x 2*) Θ(x. B*) X 2* XB* Bootstrap Samples Resampling Variablity Data X=(x 1, x 2, …, xn) Sampling Variablity Population F
Application: Heart Measurements on Children 4 Study: Catheterize 12 children with heart defects and take measurements. 4 Variables measured: – y: observed catheter length in cm – w: patient’s weight in pounds – h: patient’s height in inches 4 Goal: To predict y from w and h. 4 Difficulties: Small n, non-normal data.
Model and Fitted Model
Model and Fitted Model 4 Model: y = b 0 + b 1 w + b 2 h + e. 4 Fitted Model:
Model and Fitted Model 4 Model: y = b 0 + b 1 w + b 2 h + e. 4 Fitted Model: 4 Parameter estimates for the heart data are: – b 0 estimated as 25. 6, – b 1 estimated as 0. 277, – b 2 term eliminated from model (not useful).
Regression Results 4 Both parameters b 0 and b 1 are significantly different from 0 (important to the model): – p-values: 0. 000 (for b 0) and 0. 000 (for b 1) – R 2 = 80% (of variation in y explained) 4 Once weight is known, height does not provide additional useful information. 4 Example: For a child weighing 50 lbs. , the estimated distance is 39. 45 cm.
Comparison of CV and Bootstrap 4 MSEP Estimates: – CV: MSEP = 18. 05 – Bootstrap: MSEP = 12. 04 (smaller = better) 4 For this example: The Bootstrap has the better prediction capability. 4 In general: – CV methods work well for large samples. – Bootstrap is effective, even for small samples.
Cross-Validation vs. Bootstrap Estimates of Prediction in Statistical Modeling Kaniz Rashid Lubana Mamun MS Student: CSU Hayward Dr. Eric A. Suess Assistant Professor of Statistics: CSU Hayward
- Slides: 29