QUANTILE REGRESSION AND WINSORIZING ISQS 5349 March 24

  • Slides: 27
Download presentation
QUANTILE REGRESSION AND WINSORIZING ISQS 5349 * March 24, 2015 Corey Collins Kimberly Tribou

QUANTILE REGRESSION AND WINSORIZING ISQS 5349 * March 24, 2015 Corey Collins Kimberly Tribou

Discussion Outline § Define outliers. § Strategies for dealing with outliers. § Demonstrate winsorization

Discussion Outline § Define outliers. § Strategies for dealing with outliers. § Demonstrate winsorization using R. § Demonstrate quantile regression using R.

WHAT ARE OUTLIERS? “The traditional […] way of looking at the world begins by

WHAT ARE OUTLIERS? “The traditional […] way of looking at the world begins by focusing on the ordinary, and then deals with exceptions or so-called outliers as ancillaries. But there is a second way, which takes the exceptional as a starting point and treats the ordinary as subordinate” --Nassim Nicholas Taleb, the Black Swan

Definition of an outlier “In a sample of N observations, it is possible for

Definition of an outlier “In a sample of N observations, it is possible for a limited number to be so far separated in value from the remainder that they give rise to the question whether they are not from a different population, or that the sampling technique is at fault. ” --the Oxford Dictionary of Statistical Terms (cited in Finney 2006)

Potential Methods for Addressing Outliers Examples of Outliers § Data entry error § Missing

Potential Methods for Addressing Outliers Examples of Outliers § Data entry error § Missing observations § Zero or “extravagantly large” believed to be unrelated to the experimental treatment. Strategies § If practical, correct observations and re-compute estimated models. § Incorporate outliers in analysis, commenting on outliers as part of final interpretation. § Confirm potential outliers through replication of study. § Correct measurement distorted by chance or § Replace outliers with new measurements, extreme error. § Spasmodic malfunction of a measuring instrument. § Correct measurement incorrectly fitted to a model “collapsing” them in with other data. § Remove (replace) outliers and analyze remaining observations as though the outliers had never existed. § Do nothing, recognizing that extreme outliers may signal an interesting or important phenomenon. (Source: Finney 2006)

WINSORIZING

WINSORIZING

Winsorizing § Objective: to diminish the effect of the outlier (Yale and Forsythe 1976).

Winsorizing § Objective: to diminish the effect of the outlier (Yale and Forsythe 1976). § Method: redefining the most extreme values (possible outliers) to the next most extreme values (Yale and Forsythe 1976) 24 23 22 21 20 19 18 17 16 15 8 9 10 11 12 13 14 15 16 17

Applying Winsorizing Let: § yi be the ith observation from a sample of n(x,

Applying Winsorizing Let: § yi be the ith observation from a sample of n(x, y) points § be some estimate of yi , given xi § The residual, di, equal yi – The winsorized regression line is the Least Squares method on a treated sample of n(x, y’) points, where y’ is defined as: (Source: Yale and Forsythe,

Winsorization Demonstration in R § Install the R “Psych” package § library(psych) § Apply

Winsorization Demonstration in R § Install the R “Psych” package § library(psych) § Apply winsor () command to OLS function § lm(y~x) § winsor(x, trim, na. rm) § Add the winsorized values § y’ = fitted( ) +winsor( ) § plot (y’ ~ x)

“Audit Lag” Simulation Using R Audit lag: the time between when a company’s fiscal

“Audit Lag” Simulation Using R Audit lag: the time between when a company’s fiscal year ends and when the audited financial statements are released. § As an auditor’s client tenure increases, we would expect that the audit lag generally decreases. § True function

OLS versus Winsorized Scatterplot: Audit Lag Simulation

OLS versus Winsorized Scatterplot: Audit Lag Simulation

OLS versus Winsorized Regression Fit: Audit Lag Simulation OLS Estimate Confidence intervals: 53. 196

OLS versus Winsorized Regression Fit: Audit Lag Simulation OLS Estimate Confidence intervals: 53. 196 < β 0 < 57. 483 -0. 1599 < β 1 <-0. 0063 Winsorized OLS Estimate Confidence intervals: 54. 852 < β 0 < 55. 20495 -0. 1082 < β 1 <-0. 08085

Audit Fee Demonstration § Data Source: WRDS Audit Analytics File § Industry: Casual Apparel

Audit Fee Demonstration § Data Source: WRDS Audit Analytics File § Industry: Casual Apparel § X variable: Logarithmically transformed Total Book Value (as of FYE) § Y variable: Logarithmically transformed Total Audit Fees

Advantages of Winsorization § Winsorization provides greater precision than OLS when data is ridden

Advantages of Winsorization § Winsorization provides greater precision than OLS when data is ridden with outliers. § Alternate estimation procedures (i. e. maximum likelihood) may require more difficult computation. (Winsor, 1946) § Treating the outlying values may improve the generalizability. § In some settings (accounting), winsorizing is customary.

Disadvantages of Winsorization § Winsorizing materially alters the content of the data and the

Disadvantages of Winsorization § Winsorizing materially alters the content of the data and the appearance of related scatterplot diagrams. § Winsorization can mask the outliers indicate irregularities that need to be addressed. § Winsorization alters the means by replacing the extreme tail values (myweb. ttu. edu/pwestfall/ISWS 5349/OUTLIERS. pdf) § Removing or replacing data observations can have ethical ramifications (Finney, 2006)

QUANTILE REGRESSION

QUANTILE REGRESSION

Quantile Regression § The name “quantile” simply refers to the separation of the observations

Quantile Regression § The name “quantile” simply refers to the separation of the observations into equal parts—percentiles (100), deciles (10), quintiles (5), quartiles (4), and so on. § Example: On the GMAT, students are given a percentile ranking. § If the student scores… § Higher than 90% of the other students § Lower than the other 10% of students … they are at the 90 th percentile, or the 0. 90 quantile of the distribution. Koenker and Hallock, 2001

Quantile Regression vs. Ordinary Least Squares § OLS looks at the means of the

Quantile Regression vs. Ordinary Least Squares § OLS looks at the means of the distribution and minimizes the sum of the squared residuals § Quantile Regression at the median and minimizes the sum of the absolute residuals § Quantile regression at the τth sample quantile (other than the median), minimizes the sum of asymmetrically weighted absolute residuals § Above the regression line (positive residuals): τ § Below the regression line (negative residuals): 1 -τ Koenker and Hallock, 2001

Conditional Distribution of p(y|x) τ = 0. 75 τ = 0. 50 τ =

Conditional Distribution of p(y|x) τ = 0. 75 τ = 0. 50 τ = 0. 25 Adapted from Fitzenberger, 2012

Quantile Regression § This is all about the conditional distribution, p(y|x). The observations we

Quantile Regression § This is all about the conditional distribution, p(y|x). The observations we see were produced by an unknown data generating process and the outliers still provide information useful to understanding the process.

Quantile Regression § So what does this mean? § OLS Regression § E(Y|X=x) =

Quantile Regression § So what does this mean? § OLS Regression § E(Y|X=x) = x’β § This form of regression will return a single slope/rate of change (β) § Quantile Regression § qτ(Y|X=x) = x’β(τ) § This form of regression will return different slopes/rates of change (βs) for different quantiles of the response variable (Y) distribution. Hao and Naiman, 2007

Assumptions § OLS… § Assumes Normality § Assumes Constant Variance (Homoscedasticity) § Assumes Linearity—mean

Assumptions § OLS… § Assumes Normality § Assumes Constant Variance (Homoscedasticity) § Assumes Linearity—mean is a linear function of X § Assumes Uncorrelated Errors, but adjustments are available § Quantile Regression… § Does not assume anything about the distribution § Does not assume constant or non-constant variance (Heteroscedasticity allowed) § Assumes Linearity—the quantile is a linear function of X § Assumes Uncorrelated Errors, but adjustments are available Cade and Noon, 2003

Quantile Regression – Why does this matter? § Recall: An estimated 55% of accounting

Quantile Regression – Why does this matter? § Recall: An estimated 55% of accounting research utilizes Winsorizing § Quantile Regression can be used to mitigate the effects of outliers and see how they are weighted throughout the sample without altering the observations or removing the outliers from the data. § Limitations § The dependent variable in the Quantile Regression must be practically continuous (non-discrete) § Sample data set must be large enough to provide enough information at each quantile Leone et al. , 2013 (Working Paper)

Quantile Regression in R § The Quantile Regression function requires the “quantreg” package in

Quantile Regression in R § The Quantile Regression function requires the “quantreg” package in R § Demonstrations: § Simulation § Audit Fees Data

Summary § Winsorization § Redefines the most extreme values to something less extreme §

Summary § Winsorization § Redefines the most extreme values to something less extreme § Systematically biases the distribution over or under the true regression line § Quantile Regression § Expects outliers and uses them as part of the analysis § Allows analysis of different portions of the data without altering or removing data produced by the true process

References (Winsorizing) Chen, E. H. and Dixon, W. J. , (1972), “Estimates of Parameters

References (Winsorizing) Chen, E. H. and Dixon, W. J. , (1972), “Estimates of Parameters of a Censored Regression Sample, ” Journal of the American Statistical Association, 67: 664 -671. Finney, D. J. (2006), “Calibration Guidelines Challenge Outlier Practices, ” The American Statistician, 60, 309 -314. Leone, A. J. , Minutti-Meza, M. , and Wasley, C. (2013), “Influential Observations and Inference in Accounting Research, ” (working paper). Kennedy, D. , Lakonishow, J. , and Shaw, W. H. (1992), “Accommodating Outliers and Nonlinearity in Decision Models, ” Journal of Accounting, Auditing, and Finance, 7, 161 -190. Moussa-Hamouda, E. and Leone, F. C. , (1977), “Efficiency of Ordinary Least Squares Estimators from Trimmed and Winsorized Samples in Linear Regression, ” Technometrics, 19, 265 -273. Westfall, P. H. and Henning, K. S. S. (2013), Understanding Advanced Statistical Methods. Chapman and Hall/CRC. Winsor, C. P. (1946), “Which Regression? , ” Biometrics Bulletin, 2, 101 -109.

References (Quantile Regression) Blatna, D. (2006 -03). Outliers in regression. Trutnov, Vol. 30 Cade,

References (Quantile Regression) Blatna, D. (2006 -03). Outliers in regression. Trutnov, Vol. 30 Cade, B. S. , & Noon, B. R. (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment, 1(8), 412 -420. http: //www. fort. usgs. gov/products/publications/21137. pdf Fitzenberger, Bernd (2012). Quantile Regression. Universität Linz. http: //www. econ. jku. at/members%5 CDerntl%5 Cfiles%5 CPHD%5 CFitzenberger_Quan tile. Regression. pdf Hao, L. , & Naiman, D. Q. (2007). Quantile regression (No. 149). Sage. http: //www. sagepub. com/upm-data/14855_Chapter 3. pdf Koenker, R. (2015). Quantile Regression in R: A Vignette. Working Paper Koenker, R. and K. F. Hallock (2001). Quantile Regression. Journal of Economic Perspectives. Vol 15(4) 143 -156.