Which Betas are Smart Campbell R Harvey Duke

Which Betas are Smart? Campbell R. Harvey Duke University, NBER and Man Group plc @camharvey Version: October 29, 2016 1

Terminology 2

Terminology I thought this beta was smart but that was a mistake: False Positive 3

Terminology 4

Terminology I didn’t invest in this beta but that was a mistake: False Negative Type II linked to Type I 5

Three forces contributing to Type I errors • Evolutionary propensity to tolerate Type I error • Randomness – with enough tests, something will look “significant” • Rare effects – we incorrectly ignore prior beliefs leading to a high error rate 6

Evolutionary Foundations • We have a very high tolerance for Type I error • There is a tradeoff of Type I and Type II errors • For example, if we declared all patients pregnant there would be a 0% Type II error, but a very large Type I error 7

Randomness: Noise routinely mistaken for signal 8

Rare Effects: 500 Shades of Gray Experiment conducted at University of Virginia • Hypothesis: Political extremists see only black and white – literally. • Experiment: Show words in different shades of gray and then ask participants to try to match color on gradient. • Afterwards, evaluate where their political beliefs place on the spectrum and test hypothesis that moderates are more accurate. Nosek, Spies and Motyl (2012) 9

Rare Effects: 500 Shades of Gray Hello Drag slide to match the colour of the word 10

Rare Effects: 500 Shades of Gray Group 2: Extremists Group 1: Moderates 11

Rare Effects: 500 Shades of Gray Dramatic results with large sample of 2, 000 participants • Moderates were able to see significantly more shades of gray • P-value<0. 001 which is highly significant; Implying only a 0. 1% chance that the observed test results were consistent with the null hypothesis of no effect 12

Rare Effects: 500 Shades of Gray Researchers decided to replicate before submitting results for publication in a top journal • Replication saw no significant difference • P-value was 0. 59 (not even close to significant) 13

Rare Effects: 500 Shades of Gray Lesson: If the hypothesis is unlikely, then we need to be especially careful. There will be a lot of false positives using standard testing procedures. Ideally, we incorporate information in the testing procedure when we know the effect is rare. 14

Rare Effects: The Power Pose Hypothesis: Standing in a posture of confidence impacts testosterone and cortisol levels in the brain leading to increased risk taking. Evidence: Carney, Cuddy and Yap (2010) Psychological Science Carney, Cuddy and Yap, 2012. Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance, Psychological Science 21(1) 1363 -1368.

Rare Effects: The Power Pose Second most viewed TED talk in history

Rare Effects: The Power Pose New York Times Best Seller (reached #3)

Rare Effects: The Power Pose

Rare Effects: The Power Pose Simmons and Simonsohn, 2016, https: //ssrn. com/abstract=2791272

Rare Effects: The Power Pose Simmons and Simonsohn, 2016, https: //ssrn. com/abstract=2791272 Also see Gelman and Fung, 2016. http: //www. slate. com/articles/health_and_science/2016/01/amy_cuddy_s_power_pose_research_is_the_latest_example_of_scientific_overreach. html

Rare Effects: The Power Pose 24 studies Simmons and Simonsohn, 2016, https: //ssrn. com/abstract=2791272

Rare Effects: The Power Pose Dana Carney retracts http: //faculty. haas. berkeley. edu/dana_carney/pdf_My%20 position%20 power%20 poses. pdf

Medicine Fact: 1% of women aged 40 -50 have breast cancer • 90% of breast cancers correctly identified with mammogram • 10% is the rate of false diagnosis Question: Suppose the test comes back positive. What is the probability you have breast cancer? 23

Polling Question: Suppose the test comes back positive. What is the probability you have breast cancer? • 90% • 78% • 38% • 8% 24

Rare Effects: Medicine • Sample size=1, 000, 10 true cases • Test 90% accurate, 9/10 of women with cancer will test significant. What about the remaining 990 tests? 25

Rare Effects: Medicine • Sample size=1, 000, 10 true cases • Test 90% accurate, 9/10 of women with cancer will test significant. What about the remaining 990 tests? • 99/990 will be false significant Given a significant test, what is the probability of cancer? 9/(9+99) = 8% 26

What about Finance? Performance of trading strategy is very impressive. • SR=1 • Consistent • Drawdowns acceptable Source: AHL Research 27

What about Finance? Source: AHL Research 28

What about Finance? Sharpe = 1 Sharpe = 2/3 Sharpe = 1/3 200 random time-series mean=0; volatility=15% Source: AHL Research 29

Other Sciences? Particle Physics § Higg’s boson proposed in 1964 (same year as Sharpe published the CAPM) § First tests of the CAPM in 1972 and Nobel award in 1990. § Longer road for Higgs: § $5 billion to construct LHC. § “Discovered” in 2012. § Nobel 2013. 30

Other Sciences? Particle Physics § Testing method very important § Particle rare and decays quickly and the key is measuring the decay signature § Frequency is 1 in 10 billion collisions and over a quadrillion collisions were conducted § Problem is that the decay signature could also be caused by normal events from known processes 31

Other Sciences? Particle Physics § The two groups involved in testing (CMS and ATLAS) decided on what appears to be a tough standard: t-ratio must exceed 5 32

Terminology P-value (probability value, low value is good) § In a test, we want a low chance of a Type I error (false positive) and usually set the significance level at 5%. (Often referred to as 95% confidence. ) § This is the 2 -sigma rule. If the effect is two standard deviations or more from zero, then there is roughly only a 5% chance of a Type I error. § Ideally, we look for more than 2 -sigma (smaller p-values) 33

Examples in Financial Economics Two sigma rule only appropriate for a single test § As we do more tests, there is a chance we find something “significant” (by the two sigma rule) but it is a fluke. § Here is a simple way to see the impact of multiple tests: Alphabet 34

Examples in Financial Economics 3. 4 sigma strategy • Profitable during fin crisis • Zero beta vs. market, value, size, and momentum • Impressive performance recently 35

Examples in Financial Economics Details • Long tickers “S” • Short tickers “U” 36

Examples in Financial Economics Research • Companies with meaningful ticker symbols, like Southwest’s LUV, and show they outperform. 1 • There is another study that argues that tickers that are easy to pronounce, like BAL vs. BDL, outperform in IPOs. 2 • There is yet another study that suggests that tickers that are congruent with the company’s name, outperform. 3 1 Head, Smith and Watson, 2009; 2 Alter and Oppenheimer, 2006; 3 Srinivasan and Umashankar, 2014 37

Examples in Financial Economics Product? ’ 38

Examples in Financial Economics Product? “Using these Alpha. Bet portfolios, one may be able to construct other betas. For example, using the portfolios S, M, A, R, and T, an investor can build a SMART-beta using the Alpha. Bet ETFs. Not surprisingly, this SMART-beta portfolio handily outperformed the S&P 500: 19. 6% to 9. 4% annualized over the period. And what about the ALPHA portfolio? Yep, it also outperformed the S&P 500 Index, with an annualized return of 18. 7%. Even “BETA” outperforms. ” 39

Examples in Financial Economics Product? Using these Alpha. Bet portfolios, one may be able to construct other betas. For example, using the portfolios S, M, A, R, and T, an investor can build a SMART-beta using the Alpha. Bet ETFs. Not surprisingly, this SMART-beta portfolio handily outperformed the S&P 500: 19. 6% to 9. 4% annualized over the period. And what about the ALPHA portfolio? Yep, it also outperformed the S&P 500 Index, with an annualized return of 18. 7%. Even “BETA” outperforms. Yes, this is a spoof! 40

Examples in Financial Economics 5 factors 41

Examples in Financial Economics 15 factors 42

Examples in Financial Economics 82 factors Source: The Barra US Equity Model (USE 4), MSCI (2014) 43

Examples in Financial Economics 400 factors! Source: https: //www. capitaliq. com/home/who-we-help/investment-management/quantitative-investors. aspx 44

Examples in Financial Economics 18, 000 signals examined in Yan and Zheng (2015) 45

A framework to separate luck from skill Three research initiatives: 1. Explicitly adjust for multiple tests (“Backtesting”) 2. Bootstrap (“Lucky Factors”) 3. Noise reduction (“Rethinking Performance Evaluation”) 46

1. Multiple Tests § Provide a new framework to do multiple tests in the presence of correlations among tests and publication bias (hidden tests) § Provide guidelines for future research 47

1. Multiple Tests: Number of Factors and Publications # of papers Cumulative 012 20 1 20 0 19 9 19 8 19 7 19 6 19 7 # of factors 709 0 406 0 103 40 800 10 597 80 294 20 991 120 688 30 385 160 082 40 779 200 476 50 173 240 870 60 567 280 264 70 19 6 Per year Factors and Publications Cumulative # of factors 48

1. Multiple Tests: How Many Discoveries Are False? § In multiple testing, how many tests are likely to be false? § In single testing (significance level = 5%), 5% is the “error rate” (false discoveries) § In multiple testing, the false discovery rate (FDR) is usually much larger than 5% 49

1. Multiple Tests: Bonferroni's Method § Here is a simple adjustment called the Bonferroni adjustment § For a single test, you are tolerant of 5% false discoveries § Hence, a p-value of 5% or less means you declare a finding “true” § Bonferroni simply multiplies the p-value by the number of tests 50

1. Multiple Tests: Bonferroni's Method § Bonferroni simply multiplies the p-value by the number of tests § In a single test, if you get a p-value of 0. 05 you declare “significant” § Suppose the stock portfolio beginning with the letter “S” outperforms with a p-value of 0. 02– which appears “significant” § Bonferroni adjustment 26 x 0. 02 = 0. 52 which is “not significant” – not even close! More than 50% chance a fluke. 51

1. Multiple Tests: Rewriting History 5. 0 800 MOM DCG 4. 0 t-ratio 640 SRV 3. 5 3. 0 720 560 CVOL LIQ LRV MRT 2. 5 DEF EP 2. 0 480 IVOL 400 SMB 320 316 factors in 2012 if working papers are included 1. 5 240 1. 0 160 0. 5 80 0. 0 1965 0 1975 1985 1995 2005 Bonferroni Holm BHY T-ratio = 1. 96 (5%) 2015 Cumulative # of factors HML 4. 5 2025 52

1. Multiple Tests: Discussion However: § Independence among test statistics is still not dealt with. § The number of hidden tests seems too low. 53

1. Multiple Tests: A New Framework No skill. Expected return = 0% Skill. Expected return = 6% 54

1. Multiple Tests: Harvey, Liu and Zhu Approach § Allows for correlation among strategy returns § Allows for missing tests § Review of Financial Studies, 2016 55

1. Multiple Tests: Backtesting § Due to data mining, a common practice in evaluating backtests of trading strategies is to discount Sharpe ratios by 50% § The 50% haircut is only a rule of thumb; we develop an analytical way to determine the haircut 56

1. Multiple Tests: Backtesting Method § Suppose we observe a strategy with an attractive Sharpe Ratio. § This Sharpe Ratio directly implies a p-value (which roughly tells you the probability that your strategy is a fluke) § Suppose the p-value is 0. 01 which looks pretty good. 57

1. Multiple Tests: Backtesting Method § However, suppose you tried 10 strategies and picked the best one § The Bonferroni adjusted p-value is 10 x 0. 01 = 0. 10 which would not be deemed “significant” § Reverse engineer the 0. 10 back to the “haircut” Sharpe Ratio* 58

1. Multiple Tests: Backtesting Results: Percentage Haircut is Non-Linear Journal of Portfolio Management 59

2. Bootstrapping Multiple testing approach has drawbacks § Need to know the number of tests § Need to know the correlation among the tests § With similar sample sizes, this approach does not impact the ordering of performance 60

2. Bootstrapping 100 candidate factors and 500 observations. §Step 1. Demean all factor returns (same as regressing on vector of ones). The average factor return and t-stat are exactly equal zero – we have enforced zero ability to explain cross-section of expected return. §Step 2. Bootstrap rows of the data to produce a new sheet 500 x 100 (note some rows sampled more than once and some not sampled at all) 61

Insert animation here 62

2. Bootstrapping §Step 3. Recalculate the factor average returns and t-stats on new data. Save the highest t-statistic (or any test statistic that you specify) from the 100 factors. Note, in the unbootstrapped data, every t-statistic is exactly zero. §Step 4. Repeat steps 2 and 3 10, 000 times. §Step 5. Now that we have the empirical distribution of the max t -statistic under the null of no cross-sectional explanatory power, compare to the max t-statistic in real data. 63

2. Bootstrapping §Step 5 a. If the max t-stat in the real data fails to exceed the threshold (95 th percentile of the null distribution), stop (no factors are significant). §Step 5 b. If the max t-stat in the real data exceeds the threshold, declare the factor, say, F 7, “true” 64

2. Bootstrapping §Step 6. Regress each of the 99 remaining factors on the actual F 7 and create adjusted factors by subtracting out the intercept from each regression. That is, for F 1, run the time-series regression F 1 = a 1 + b 1 F 7 + e 1. The adjusted F 1 is F 1*=F 1 - a 1 65

2. Bootstrapping §Step 7. Note that each of the 99 adjusted factors, F* are exactly captured by F 7 so they have zero incremental explanatory power for the cross-section of expected returns. That is, E[F 1*]=b 1 E[F 7] 66

2. Bootstrapping §Step 8. Repeat Steps 2 -5. If adjusted factor exceeds 95% threshold, declare the factor true if not stop. §Step 9. Continue until Step 8 tells you to stop. Note: There is a difference between fund managers and factors. For fund managers, we don't need to adjust the rest of the fund returns with respect to the max fund had the max been found significant. For factors, we do need the adjustment because we want to assess incremental contribution. 67

2. Bootstrapping Baseline model Candidate factors Yes Augmented model No Terminate to arrive at the final model 68

2. Bootstrapping 69

2. Bootstrapping §Addresses data mining directly §Allows for cross-correlation of the factors because we are bootstrapping rows of data §Allows for non-normality in the data (no distributional assumptions imposed – we are resampling the original data) §Potentially allows for time-dependence in the data by changing to a block bootstrap. §Answers the questions: §How many factors? §Which factors are just lucky? 70

3. Noise reduction: Rethinking Issue § Past alphas do a poor job of predicting future alphas (e. g. , top quartile managers are about as likely to be in top quartile next year as this year’s bottom quartile managers!) § Same for smart beta: 71

3. Noise reduction: Rethinking Issue § This could be because all betas are false betas – or it could be a result of a lot of noise historical performance 72

3. Noise reduction: Rethinking Goal § Develop a metric that maximizes cross-sectional predictability of performance § Useful for separating “smart” vs. “not-smart” § Maybe the “Siren Song” story is wrong and there is some ability to factor time 73

3. Noise reduction: Rethinking Observed performance consists of three components: • True factor premia • Unmeasured risk (e. g. low vol strategy having negative convexity) • Noise (good or bad luck) 74

3. Noise reduction: Rethinking Method: • Let’s start with the one factor that people generally agree with: a market return. Factor premium implied by theory and has a long history of tests. • Measure all factors relative to the market (this strips out “market beta” from all alt betas) • Estimate beta and alpha • Note the method is even easier to apply if we look at the raw performance (no beta adjustment) 75

3. Noise reduction: Rethinking Intuition § But alpha is overfit. Regression maximizes the time-series R 2 for a particular factor. § This time-series regression has nothing to do with crosssectional predictability. § All of the noise will be put in the alpha. § No surprise that past alpha have no ability to forecast future alphas 76

3. Noise reduction: Rethinking Our approach § We follow the machine learning literature and “regularize” the problem by imposing a parametric distribution on the crosssection of alphas. § Leads to lower time-series R 2 – but higher cross-sectional R 2 77

3. Noise reduction: Rethinking • t-stat = 3. 9%/4. 0% = 0. 98 < 2. 0 • alpha = 0 cannot be ruled out 78

3. Noise reduction: Rethinking • Both t-stats < 2. 0 • alpha = 0 cannot be rejected for either 79

3. Noise reduction: Rethinking • t-stat < 2. 0 for all factors • alpha = 0 cannot be excluded for all • However, population mean seems to cluster around 4. 0%. Should we declare all alphas as zero? Estimated alphas cluster around 4. 0% 80

3. Noise reduction: Rethinking • Although no individual factor has a statistically significant alpha, the population mean seems to be well estimated at 4. 0%. 81

3. Noise reduction: Rethinking Method • Assume alpha distribution is a mixture of normal (which is very general) • Implement the iterative EM algorithm that balances the dual goals of time-series R 2 and cross-sectional R 2 • If time-series has produces a good fit, then more weight put in time-series for a particular fund • If there is a short sample and not much information in the time -series, more weight is put on drawing from the cross-sectional distribution 82

3. Noise reduction: Rethinking 83

3. Noise reduction: Rethinking • In-sample: 1984 -2001; Out-of-sample: 2002 -2011; (application to fund performance) NRA forecast OLS forecast error (%) # of funds (-∞, -2. 0) 3. 29 6. 61 64 [-2. 0, -1. 5) 3. 09 3. 70 75 [-1. 5, 0) 2. 75 2. 92 565 [0, 1. 5) 2. 61 5. 54 610 [1. 5, 2. 0) 2. 38 10. 47 87 [2. 0, +∞) 2. 77 12. 02 87 Overall 2. 71 *Mean absolute forecast errors. 5. 17 1, 488 84

Final perspectives § Combination of: propensity for Type I errors, incorrect testing methods, and lack of effort to reduce noise implies § Most published empirical research findings are likely false § Most the smart beta products are not “smart” § No predictability in performance § My research makes progress on goal of identifying repeatable performance § There a host of other issues: § Factor loadings also noisy § Ex-post factor loading unfairly punish market timers § It is essential to look beyond the Sharpe Ratio and incorporate other info 85

Credits Based on our joint work: Joint work with Yan Liu Texas A&M University § “… and the Cross-section of Expected Returns” http: //ssrn. com/abstract=2249314 [Best paper in investment, WFA 2014] § “Backtesting” http: //ssrn. com/abstract=2345489 [Bernstein Fabozzi/Jacobs-Levy best paper, JPM 2015] § “Evaluating Trading Strategies” [Bernstein Fabozzi/Jacobs-Levy best paper, JPM 2014] http: //ssrn. com/abstract=2474755 § “Lucky Factors” http: //ssrn. com/abstract=2528780 § “Rethinking Performance Evaluation” http: //ssrn. com/abstract=2691658 86