Chapter 4 Prediction and Bayesian Inference 4 1

Chapter 4 Prediction and Bayesian Inference • 4. 1 Estimators versus predictors • 4. 2 Prediction for one-way ANOVA models – Shrinkage estimation, types of predictions • 4. 3 Best linear unbiased predictors (BLUPs) • 4. 4 Mixed model predictors • 4. 5 Bayesian inference • 4. 6 Case study: Forecasting lottery sales • 4. 7 Credibility Theory • Appendix 4 A Linear unbiased predictors

4. 1 Estimators versus predictors • In the longitudinal data model, yit = zit´ ai + xit´ b + it , the variables {ai} describe subject-specific effects. • Given the data {yit, zit, xit}, in some problems it is of interest to “summarize” subject effects. – We have discussed how to estimate fixed, unknown parameters. – It is also of interest to summarize subject-specific effects, such as those described by the random variable a i. • Predictors are “estimators” of random variables. – Like estimators, predictors are said to be linear if they are formed from a linear combination of the response y.

Applications of prediction • In animal and plant breeding, one wishes to predict the production of milk for cows based on (1) their lineage (random) and (2) herds (fixed) • In credibility theory, one wishes to predict expected claims for a policyholder given exposure to several risk factors • In sample surveys, one wishes to predict the size of a specific age-sex-race cohort within a small geographical area (known as “small area estimation”). • In a survey article, Robinson (1991) also cites (1) ore reserve estimation in geological surveys, (2) measuring quality of a production plan and (3) ranking baseball players abilities.

4. 2. Prediction for one-way ANOVA models • Consider the traditional one-way random effects ANOVA (analysis of variance) model: yit = ma + it – Suppose that we wish to summarize the subject-specific conditional mean, ma + i. • For contrast, first consider using the fixed effects model with ma = 0. – Here, we have that is the “best” (Gauss-Markov) estimate of i. – This estimate is unbiased, that is, E = i. – This estimate has minimum variance among all linear unbiased estimators (BLUE).

Shrinkage estimator • Using the one-way random effects model. – Consider an “estimator” of ma + i that is a linear combination of and , that is, for constants c 1 and c 2. • Calculations show that the best values of c 1 and c 2 that minimize are c 2 = 1 – c 1 and • For large n, we have the shrinkage estimator, or predictor, of ma + i to be , where

Example of shrinkage estimator Hypothetical Run Times for Three Machines • • Machine 1 2 3 Run Times 14, 12, 10, 12 9, 16, 15, 12 8, 10, 7, 7 Average Run Time 1 = 12 2 = 13 3 = 8 – Notation: yij means the jth run from the ith machine. – For example, y 21 = 9 and y 23 = 15. • Are there real differences among machines?

Example - Continued • To see the “shrinkage” effect, consider 8 11 8. 525 • 12 13 11. 825 12. 650 Figure 4. 1 Comparison of Subject-Specific Means to Shrinkage Estimators.

More on shrinkage estimators • Under the random effects model, is an unbiased predictor of ma+ i in the sense that E - (ma + i) = 0. – However, is inefficient in the sense that has a smaller mean square error than. – Here, has been “shrunk” towards the stable estimator – The “estimator” is said to “borrow strength” from the stable estimator • Recall • Note that i 1 as either (i) Ti or (ii) 2/ 2 .

Best predictors • From Section 3. 1, it is easy to check that the generalized least square estimator of ma is • The linear predictor of ma + ai that has minimum variance is = i + (1 - i ) m , GLS. – Here, the acronym BLUP stands for best linear unbiased predictor.

Types of Predictors • We have now introduced the BLUP of ma + i. This quantity is a linear combination of global parameters and subject-specific effects. • Two other types of predictors are of interest. – Residuals. Here, we wish to “predict” it. The BLUP residual turns out to be – Forecasts. Here, we wish to predict, for “L” lead time units into the future, – Without serial correlation, the predictor is the same as the predictor of ma + i. However, we will see that the mean square error turns out to be larger.

4. 3 Best linear unbiased predictors • This section develops best linear unbiased predictors in the context of mixed linear models, then specializes the consideration to longitudinal data mixed models. • BLUPs are developed by examining the minimum mean square error predictor of a random variable, w. – We give a development due to Harville (1976). – The argument is originally due to Goldberger (1962), who coined the phrase best linear unbiased predictor. – The acronym was first used by Henderson (1973). • BLUPs can also be developed as conditional expectations using multivariate normality • BLUPs can also be developed in a Bayesian context.

Mixed linear models • Suppose that we observe an N 1 random vector y with mean E y = X b and variance Var y = V. – We wish to predict a random variable w, that has mean E w = l b and Var w = sw 2. – Denote the covariance between w and y as Cov(w, y) = covwy. • Assuming known regression parameters (b), the best linear (in y) predictor of w is w* = E w + covwy V-1(y - E y ) = l b + covwy V-1(y - X b ). – If w, y are multivariate normal, then w* equals E (w | y ) and hence is a minimum mean square predictor of w. – The predictor w* is also a minimum mean square predictor of w without the assumption of normality. See Appendix 4 A. 1.

BLUP’s as predictors • To develop the BLUP, – define b. GLS = ( X V -1 X )-1 X V-1 y to be the generalized least squares (GLS) estimator of b. – This is the best linear unbiased estimator (BLUE). – Replace b by b. GLS in the definition of w* to get the BLUP w. BLUP = l b. GLS + covwy V-1(y - X b. GLS ) = (l - covwy V-1 X) b. GLS + covwy V-1 y. – See Appendix 4 A. 2 for a check, establishing w. BLUP as the best linear unbiased predictor of w. • From Appendix 4 A. 3, we also have the form for the minimum mean square error: Var (w. BLUP - w) = (l - covwy V-1 X) ( X V -1 X )-1 (l - covwy V-1 X) - covwy V-1 covwy + sw 2.

Example: One-way model • Recall, yit = ma + it – Thus, yi = 1 i (ma + ai) + ei. Thus, Xi = 1 i and – With this, we note that Vi-1 (yi - Xi b. GLS)= – Thus, for predicting w = ma + i we have l=1 and Cov(w, yi) = 1 i sa 2 for the ith subject, 0 otherwise. Thus,

Random effect ANOVA model • For predicting residuals it we have l=0 and Cov(w, yi) = se 2 for the ith subject, tth time period, 0 otherwise. • Let 1 it be a Ti 1 vector with a 1 in the tth position, 0 otherwise. Thus, • is our BLUP residual.

4. 4 Mixed model predictors • Recall the longitudinal data mixed model yi = Zi ai + Xi b + ei • As described in Section 3. 3, this is a special case of the mixed linear model. We use V = block diagonal (V 1, . . . , Vn) , where Vi = Zi D Zi + Ri. X = (X 1 , . . . Xn ) • For BLUP calculations, note that covwy = ( Cov(w, y 1 ), …, Cov(w, yn ) )

Longitudinal data mixed model BLUP • Recall that the r. v. w has mean E w = l b and Var w = sw 2. • The BLUP is • The mean square error is Var (w. BLUP - w) =

BLUP special cases • Global parameters and subject-specific effects. – Suppose that the interest is in predicting linear combinations of global parameters b and subject-specific effect ai. – Consider linear combinations of the form w = c 1¢ ai + c 2¢ b. • Residuals. Here, w = it. • Forecasts. Suppose that the ith subject is included in the data set; predict – for L lead time units in the future.

Predicting global parameters and subject -specific effects • Consider linear combinations of the form w = c 1¢ ai + c 2¢ b. • Straightforward calculations show that – E w = c 2¢ b so that l = c 2, – Cov (w, yj ) = c 1¢ D Zi¢ for j = i – Cov (w , yj ) = 0 for j ¹ i. • Thus, w. BLUP = c 2¢ b. GLS + c 1¢ D Zi¢ Vi-1 (yi - Xi b. GLS ).

Special case 1 • Take c 2 = 0. Because the means and variance expressions are true for all vectors c 2, we may write this in vector notation to get the BLUP of ai, the vector ai, BLUP = D Zi¢ Vi-1 (yi - Xi b. GLS ). • This is unbiased in the sense that E ai, BLUP - ai = 0. • This estimate has minimum variance among all linear unbiased predictors (BLUP). • In the case of the error components model (zit = 1), this reduces to • For comparison, recall the fixed effects parameter estimate,

Motivating BLUP’s • We can also motivate BLUP’s using normal theory: – Consider the case where ai and e are multivariate normally distributed. – Then, it can be shown that E (ai | yi) = D Zi Vi-1 (yi -Xi b). – To motivate this, consider asking the question: what realization of ai could be associated with yi? The expectation! – The BLUP is the BLUE of E (ai | yi). (That is, replace b by b. GLS. )

Special case 2 • As another example, it is of interest to predict • • Choose • This yields and • This predictor is of interest in actuarial science, where it is known as the credibility estimator.

BLUP Residuals • Here, w = it. Because E w = 0, it follows that l = 0. • Straightforward calculations show that – Cov (w, yj ) = 2 1 it for j = i and – Cov (w , yj ) = 0 for j ¹ i. – Here, the symbol 1 it¢ denotes a Ti 1 vector that has a “one” in the tth position and is zero otherwise. • Thus eit, BLUP = 2 1 it¢ Vi-1 (yi - Xi b. GLS ). • This can also be expressed as

Predicting future observations • Suppose that the ith subject is included in the data set; predict – for L lead time units in the future. • We will assume that and • It follows that • Straightforward calculations show that are known. • Thus, the forecast of yi, Ti+L is • Thus, the forecast is the estimate of the conditional mean plus the serial correlation correction factor

Predicting future observations • To illustrate, consider the special case where we have autoregressive of order 1 (AR(1)), serially correlated errors. • Thus, we have • After some algebra, the L step forecast is

4. 5 Bayesian Inference • With Bayesian statistical models, one views both the model parameters and the data as random variables. – We assume distributions for each type of random variable. • Given the parameters β and α, the response model is – Specifically, we assume that the responses y conditional on α and β are normally distributed and that E (y | α, β ) = Z α + X β and Var (y | α, β) = R. • Assume that α is distributed normally with mean α and variance D and that β is distributed normally with mean μβ and variance β, each independent of the other.

Distributions • The joint distribution of (α , β ) is known as the prior distribution. • To summarize, the joint distribution of (α , β , y ) is • where V = R + Z D Z.

Posterior Distribution • The distribution of parameters given the data is known as the posterior distribution. • The posterior distribution of (α , β ) given y is normal. • The conditional moments are

Relation with BLUPs • In longitudinal data applications, one typically has more information about the global parameters β than subjectspecific parameters α. • Consider first the case β = 0, so that β = β with probability one. – Intuitively, this means that β is precisely known, generally from collateral information. – Assuming that α = 0, it is easy to check that the best linear unbiased estimator (BLUE) of E ( α | y ) is a. BLUP = D Z V-1 ( y – X b. GLS) – Recall from equation (4. 11) that a. BLUP is also the best linear unbiased predictor in the frequentist (non-Bayesian) model framework.

Relation with BLUPs • Consider second the case where β-1 = 0. – In this case, prior information about the parameter β is vague; this is known as using a diffuse prior. – Assuming α = 0, one can show that E ( α | y ) = a. BLUP • It is interesting that in both extreme cases, we arrive at the statistic a. BLUP as a predictor of α. – This analysis assumes D and R are matrices of fixed parameters. – It is also possible to assume distributions for these parameters; typically, independent Wishart distributions are used for D-1 and R-1 as these are conjugate priors. – The general strategy of substituting point estimates for certain parameters in a posterior distribution is called empirical Bayes estimation.

Example – One-way random effects ANOVA model • The posterior means turn out to be • where • Note that measures the precision of knowledge about . Specifically, we see that approaches one as 2 , and approaches zero as 2 0.

4. 6 Wisconsin Lottery Sales • T=40 weeks of sales from n =50 zip codes

Lottery Sales Data Analysis • Cross-sectional analysis shows that population size heavily influences sales, with Kenosha as an outlier • Multiple time series plots – show the effect of jackpots that is common to all postal codes – show the heterogeneity among postal codes (reaffirmed by a pooling test) – show the heteroscedasticity that is accommodated through a logarithmic transformation

Lottery Sales Model Selection • In-sample results show that – One-way error components dominates pooled crosssectional models – An AR(1) error specification significantly improves the fit. – The best model is probably the two-way error component model, with an AR(1) error specification (not yet documented) • Out-of-sample analysis suggests that – logarithmic sales is the preferred choice of response; it outperforms sales and percentage change.

4. 7. What is Credibility? • Hickman’s (1975) Analogy – In politics, leaders begin with a reservoir of credibility which decreases as executive experience is compiled. – Insurance behaves in a reverse fashion! – Here, credibility increases as experience increases.

Credibility Theory • Credibility is a technique for predicting future expected claims for a risk class, given past claims of that and related risk classes. • Importance – Credibility is widely used for pricing property and casualty, worker’s compensation and health care coverages. – According to Rodermund (1989), “the concept of credibility has been the casualty actuaries’ most important and enduring contribution to casualty actuarial science. ”

History • Mowbray (1914 - PCAS) – Asked the question, “how extensive is an exposure necessary to give a dependable pure premium? ” – This approach is now known as the “limited fluctuation” or “American” credibility • Question 1 – do we have enough exposure to give full weight to the risk class under consideration? • Question 2 – if not, how can we combine information from this and related risk classes?

More History • Whitney (1918 - PCAS) – introduced the idea of using a weighted average of average claims of (1) a given risk class and (2) all risk classes. – The weight is known as the credibility factor. – It is of the form New Premium = Z Claims Experience + (1 – Z) Old Premium.

Example - Balanced Bühlmann • Consider the model yit = + it. • The credibility factor is • The traditional credibility estimator is

Example Hypothetical Claims for Three Towns Town 1 2 3 Claims 14, 12, 10, 12 9, 16, 15, 12 8, 10, 7, 7 Average 1 = 2 = 3 = Claim 12 13 8 • Are there real differences among towns? • Mowbray - does Town 3 have enough data to support its own estimator of pure premiums? • Whitney - how can I use the information in Towns 1 and 2 to help determine my rate for Town 3?

Response to. Whitney • Known as the “shrinkage” effect 8 11 8. 525 12 13 11. 825 12. 650 • Comparison of Subject-Specific Means to Credibility Estimators.

Why study credibility theory? • Long history of applications – “a business necessity” – More recently, many theoretical advances with fewer innovative applications • Credibility techniques required in legal statutes and standards of practice – Standard of Practice 25 by the Actuarial Standards Board of the American Academy of Actuaries – Wisconsin statutes on credibility insurance and disability income • Advanced techniques are critical for keeping up with competition (health insurance – health economists) • Innovative techniques enhance the “credibility” of the profession