2010 Alabama Mr Football Coty Blanchard THEORY OF
2010 Alabama Mr. Football Coty Blanchard THEORY OF WINNING Coaching, recruiting and spending in college football
Table of Contents Introduction How to predict a win Data sources Initial Model Out of sample prediction Practical applications Next steps Jason Campbell, Auburn University
Authors Introduction Mc. Donald “Mac” Mirabile Manager of Strategic & Financial Analysis at WWF Undergraduate and graduate thesis on the predictors of a successful transition from college to NFL Prior academic publications on topics such as biases in college football polls, the NFL Rookie Cap, the Wonderlic Test, and the Peer Effect in the NFL draft Mark Witte Assistant Professor at College of Charleston
Topic Introduction The importance of winning in college Shapes alumni support, attendance Influences quality of recruiting Self-enforcing cycle
How to predict a win Vegas point spread, totals, and money line theoretically capture all available information under the efficient market hypothesis (EMH) Existing literature consistently enforces EMH, though there are some published examples of deviations and profitable strategies within wagering markets Within the framework of this paper, we will assume EMH holds within college football wagering markets and will measure the success of our developed models relative to the baseline Vegas model
Predicting Wins with the Vegas Line Bubble chart illustrates the home team’s winning percent by the Vegas Line, with the size of the bubble based on the number of observations
Predicting Wins with the Vegas Line Bar chart of home team’s winning percentage by the Vegas line
The Vegas Line model Home Win (0, 1) = b 1*Line + error This model within our data explains 29% of the variation in wins (Pseudo R 2). The line coefficient is 0. 1091, with a standard error of 0. 00437, and an Odds Ratio of 1. 115 Interpretation: for each additional point a team is favored, their odds of winning increase by 11. 5% Non-linear model shows similar results
Improving the Vegas Line model Can it be done, or does the Vegas line incorporate all publically available information? To test this, we added several variables: Home, Away win and losing streaks Home, Away AP Rankings, Top 25 matchups Dummy variables for conference games, neutral field matchups, and night games Distance between schools, stadium size, rivalry information Conference dummy variables
Improving the Vegas Line model Effect Line ETP HWS HLS AWS ALS Hrank Arank HNR ANR True. T 25 Conf. Game Neutral Nightgame Stadium Distance Rivalry Conf DF 1 1 1 1 2 12 Wald 285. 4691 1. 1213 0. 522 0. 8024 1. 8483 0. 1004 0. 7195 0. 1588 1. 591 1. 5452 0. 2535 0. 003 0. 001 0. 3414 1. 078 0. 0145 2. 0766 11. 5154 Pr > Chi. Sq <. 0001 0. 2896 0. 47 0. 3704 0. 174 0. 7513 0. 3963 0. 6903 0. 2072 0. 2138 0. 6146 0. 9566 0. 9743 0. 559 0. 2992 0. 9042 0. 3541 0. 4853 • Table on left shows these additional variables and a their corresponding Wald Chi 2 statistics • The Vegas line successfully incorporates all available information. • Adding more explanatory variables does not improve the model’s fit. • None of the added variables are statistically significant as their importance is already captured in the Line variable.
Data Sources To develop a model of winning without utilizing the Vegas line, the authors gathered data on the following topics: Game-specific factors Institutional factors/history Team player composition/recruiting Team coach factors/history We will discuss the collection and organization of this data next
Game-specific Factors Matchup data comes from Covers. com Data includes game location, time, day, conference information Each matchup (home vs away) is one observation in the dataset There about 500 games per season
Institutional Factors & History Historical team performance comes from CFBDatawarehouse. com University football team expenditure and student body size data come from the Equity in Athletics website Each of these variables is reported for a particular year (e. g. , Michigan’s historical team performance through 2007 and their team expenditure data for the 2008 season would all be used as predictors for the 2008 season matchups)
Team player composition and recruiting Class recruiting data comes from Rivals. com, Scouts. com, and Prepstar. com Recruiting classes in 2005 (RS-Senior), 2006 (Senior / RS-Junior), 2007 (Junior, RSSophomore), 2008 (Sophomore, RSFreshman), an 2009 (Freshman) are used as predictors for the 2009 season matchups. Due to the NFL draft, transfers, and general attrition, these variables are imperfect measures of the talent comprising a team in a particular season
Team coach factors and history Historical coach performance comes from CFBDatawarehouse. com Coach biographical information comes from various university athletics department websites Each of these variables is reported for a particular year (e. g. , Michigan’s coach’s historical performance through 2007 would be used as a predictor for the 2008 season matchups)
Summary Statistics of Model variables
Initial Model Matchup-specific variables: • Stadium Size • Home team student size School-specific variables: • Cumulative Team Win Pct Diff • Log Diff of Total Team expenditures Team-specific variables (Difference home – away): • Scouts. com weighted average class ranking N: 2, 948 R-Square: . 215 Coach-specific variables (Difference home – away) : • First year head coach Home team dummy • First year head coach Away team dummy • Coach age • Coach experience (assistant + HC) • Head coach seasons • Lifetime Coach Win Pct Diff • Years as NFL player • Home team’s head coach minority dummy • Away team’s head coach minority dummy
Initial Model - Interpretations Matchup-specific variables: • Stadium Size – for every additional 10, 000 seats, the home team is 4% more likely to win • (also considered game time, location, rivalry variables) School-specific variables: • Log Diff of Total Team expenditures – the odds ratio of the % difference (home/away) in team spending of 2. 5 suggests that a team spending 100% more (twice as much) is 150% more likely to win, (Alternative, equivalent interpretation: odds of winning increase 15% for each 10% increase in excess of your opponent’s expenditures) Team-specific variables (all Difference home – away) : • Scouts. com average class ranking – for each unit increase in average class ranking between the home and away, the home team is 1% more likely to win Coach-specific variables (all Difference home – away) : • First year head coach dummy variables – marginally significant and coefficients in the direction one would expect • Diff in HC’s ages – for each additional year in age difference b/w the Home and Away team’s coach, the home team is 1% less likely to win • Diff in HC’s cumulative Win % – for each 1% difference in lifetime win percentage between the home team’s HC and the away team’s HC, the home team is about 6% more likely to win • Years as NFL player – for each additional year of NFL playing experience between the home team’s HC and the away team’s HC, the home team is about 4% less likely to win • Home team Head Coach Minority – minority coaches are 42% less likely to win than non-minority coaches at home • Away team Head Coach Minority – home teams are 87% more likely to win when playing against a minority coach
Out of Sample prediction Analysis Variable : Vegas_Model_Correct sample Correct Incorrect % Correct In 1, 450 541 72. 8% Out 612 233 72. 4% Analysis Variable : Our_Model_Correct sample Correct Incorrect % Correct In 1, 420 648 68. 7% Out 603 278 68. 4% Both models have comparable in and out of sample performance
Out of Sample by Line Vegas line does a better job predicting everything except games where the line is between -2 and +2
2009 Season (SEC results) Data from 2004 -2008 used to develop the model Data from 2009 used in an out-of-sample validation Note: Non Div 1 A opponents not scored/modeled
Practical Applications Predict 2010 season results – conference standings, national champion, before a single game has been played
Next steps What can be added to the model? New sources of data (attendance, compensation/bonus – impute missing values based on relative rank of team within conference? ) Additional data cleanup (game time, more years 2001 -2003) Different estimation methodologies
BACKUP/OLD SLIDES BEGIN HERE
Who is hiring minority coaches? The coach is more likely to be young (see coach_age), belong to a historically crappy program (Cum_Win. PCT_School_H) as well as belong to a recently crappy program (MA 5_Win_PCT_School_H) of relatively newer schools (School_Seasons_H) and larger schools (Stadium).
Predicting recruiting classes GLM estimation of dependent variable: Scouts class ranking Previous year and 5 -year MA Win % impact recruiting Previous classes are also good predictors of current year’s class ranking Conference impacts recruiting Alabama (2010) = 43. 4 – (9. 7*1) – (15. 5*. 77) + (. 27*2) + (. 18*1) + (. 1*22) + (. 13*18) – 21. 8 = 3 (Actual rank 4) Auburn (2010) = 43. 4 – (9. 7*. 615) – (15. 5*. 66) + (. 27*16) + (. 18*18) + (. 1*6) + (. 13*9) – 21. 8 = 15 (Actual rank 5) Vanderbilt (2010) = 43. 4 – (9. 7*. 167) – (15. 5*. 38) + (. 27*72) + (. 18*74) + (. 1*87) + (. 13*61) – 21. 8 = 63 (Actual rank 61)
2009 out of sample (A-F)
2009 out of sample (G-M)
2009 out of sample (M-S)
2009 out of sample (S-U)
2009 out of sample (V-W)
Other considerations (backup slide) Off the field model. 18 On the field model. 26 Are the coefficients robust? Future problems: things that recruits like – new stadiums, new weight rooms, facilities Could we do a recruiting paper modeled on NCAA football recruiting info – coach history, academic prestige, location, tv time, etc
Out of Sample prediction (intercept) Analysis Variable : Vegas_Model_Correct sample Correct Incorrect % Correct In 837 315 72. 7% Out 384 132 74. 4% Analysis Variable : Our_Model_Correct sample Correct Incorrect % Correct In 787 365 68. 3% Out 352 164 68. 2% Both models have comparable in and out of sample performance
Friday Meet with profs about research Present to a class Lunch Seminar presentation Dinner
Models To begin, we will look at each of these data sources and its relationship to our outcome variable individually. Because each of these data sources is described with dozens of potential variables, this initial modeling will inform our final set of models where data from all possible sources are considered in development. All models are developed using a Logit function as our outcome variable, Home Win, is binary. We will discuss the resulting coefficients as Odds
Model 1: Game specific factors Odds Ratio Estimates Effect Point Estimate Neutral Nightgame Stadium Conference ACC vs WAC Conference Big East vs WAC Conference Big Ten vs WAC Conference Big Twelve vs WAC Conference CUSA vs WAC Conference INDP vs WAC Conference MAC vs WAC Conference Mountain West vs WAC Conference NC vs WAC Conference Pac Ten vs WAC Conference SEC vs WAC Conference Sun Belt vs WAC day_of_week Fri vs Wed day_of_week Mon vs Wed day_of_week Sat vs Wed day_of_week Sun vs Wed day_of_week Thu vs Wed day_of_week Tue vs Wed Distance 0. 804 0. 868 1. 018 0. 573 0. 741 0. 444 0. 665 0. 576 0. 554 0. 789 0. 766 0. 992 0. 483 0. 347 0. 955 1. 184 0. 662 1. 407 1. 103 1. 686 1. 727 1 90% Wald Confidence Limits 0. 62 1. 05 0. 76 0. 99 1. 01 1. 02 0. 4 0. 83 0. 49 1. 12 0. 3 0. 66 0. 46 0. 96 0. 4 0. 84 0. 36 0. 86 0. 55 1. 13 0. 52 1. 12 0. 74 1. 34 0. 33 0. 7 0. 24 0. 51 0. 64 1. 43 0. 65 2. 16 0. 31 1. 43 0. 81 2. 44 0. 55 2. 21 0. 92 3. 09 0. 85 3. 52 1 1
Model 1: Game specific factors Other considered variables Distance b/w schools Rivalry game (major/minor/none) Other variables to consider in the future: Game-time (need to clean some data)
Model 2: Institutional factors & history Effect Odds Ratio Estimates Point Estimate Cum_Losses_School_H MA 5_Wins_School_H TOTAL_EXPENSE_ALL_Fo EFMale. Count_H EFFemale. Count_H Cum_Losses_School_A MA 5_Win_PCT_School_A TOTAL_EXPENSE_ALL_Fo school_seasons_ldf cum_winpct_adf total_expense_all_fo school_seasons_31 t 75 school_seasons_m 101_ 1. 005 1. 039 0. 972 1 1 0. 996 0. 082 1. 028 0. 53 185. 1 3. 366 1. 912 0. 711 0. 715 95% Wald Confidence Limits 1. 003 1. 007 1. 03 1. 047 0. 932 1. 014 1 1 0. 994 0. 998 0. 048 0. 14 0. 986 1. 072 0. 363 0. 775 20. 63 >999. 999 2. 149 5. 273 1. 269 2. 882 0. 491 1. 031 0. 586 0. 873
Model 2: Institutional factors & history Other considered variables Other variables to consider in the future:
Model 3: Recruiting Effect Odds Ratio Estimates Point Estimate Cum_Losses_School_H MA 5_Wins_School_H TOTAL_EXPENSE_ALL_Fo EFMale. Count_H EFFemale. Count_H Cum_Losses_School_A MA 5_Win_PCT_School_A TOTAL_EXPENSE_ALL_Fo school_seasons_ldf cum_winpct_adf total_expense_all_fo school_seasons_31 t 75 school_seasons_m 101_ 1. 005 1. 039 0. 972 1 1 0. 996 0. 082 1. 028 0. 53 185. 1 3. 366 1. 912 0. 711 0. 715 95% Wald Confidence Limits 1. 003 1. 007 1. 03 1. 047 0. 932 1. 014 1 1 0. 994 0. 998 0. 048 0. 14 0. 986 1. 072 0. 363 0. 775 20. 63 >999. 999 2. 149 5. 273 1. 269 2. 882 0. 491 1. 031 0. 586 0. 873
- Slides: 40