Lecture 2 Parameter Estimation and Evaluation of Support

Parameter Estimation “The problem of estimation is of more central importance, (than hypothesis testing).

Parameter Estimation l Finding Maximum Likelihood Estimates (MLEs) - l Local optimization (optim) »

Parameter estimation: finding peaks on likelihood “surfaces”. . . The variation in likelihood for

Local vs Global Optimization l “Fast” local optimization methods - Large family of methods,

Local Optimization – Gradient Methods l Derivative-based (Newton-Raphson) methods: Likelihood surface General approach: Vary

Local Optimization – No Gradient l The Simplex (Nelder Mead) method - Much simpler

Global Optimization “Virtually nothing is known about finding global extrema in general. ” “There

Global Optimization – Grid Searches l Simplest form of optimization (and rarely used in

Global Optimization – Genetic Algorithms l Based on a fairly literal analogy with evolution

Global optimization - Simulated Annealing l l l Analogy with the physical process of

Simulated Annealing in practice. . . A version with automatic adjustment of range. .

Constraints – setting limits for the search. . . l Biological limits - Values

Simulated Annealing - Initialization l Set - Annealing schedule » » Initial temperature (t)

Simulated Annealing – Step 1 Pick a new set of parameter values (by varying

Simulated Annealing – Step 2 {call the likelihood function with the new set of

Simulated Annealing – Step 3 else {use Metropolis criteria to determine whether to accept

Simulated Annealing – Step 4 Periodically adjust the range (VM) within which new steps

Simulated Annealing Code – Final Step Reduce the “temperature” according to the annealing schedule

How many iterations? . . . Red maple leaf litterfall (6 parameters) 500, 000

Optimization - Summary l No hard and fast rules for any optimization – be

Evaluating the strength of evidence for the MLE Now that you have an MLE,

Strength of evidence for particular parameter estimates – “Support” Log-likelihood = “Support” (Edwards 1992)

Fisher’s “Score” and “Information” l “Score” (a function) = First derivative (slope) of the

Asymptotic vs. Simultaneous M-Unit Support Limits l Simultaneous: - Resampling method: draw a very

Asymptotic vs. Simultaneous Support Limits Parameter 2 A hypothetical likelihood surface for 2 parameters.

Other measures of strength of evidence for different parameter estimates l Edwards (1992; Chapter

Bootstrap methods l Bootstrap methods can be used to estimate the variances of parameter

Evaluating Support for Parameter Estimates: A Frequentist Approach l Traditional confidence intervals and standard

Example from R The Hessian matrix (when maximizing a log likelihood) is a numerical

More from R now invert the negative of the Hessian matrix to get the

Slides: 35

Download presentation

Lecture 2: Parameter Estimation and Evaluation of Support

Parameter Estimation “The problem of estimation is of more central importance, (than hypothesis testing). . for in almost all situations we know that the effect whose significance we are measuring is perfectly real, however small; what issue is its magnitude. ” (Edwards, 1992, pg. 2) “An insignificant result, far from telling us that the effect is nonexistent, merely warns us that the sample was not large enough to reveal it. ” (Edwards, 1992, pg. 2)

Parameter Estimation l Finding Maximum Likelihood Estimates (MLEs) - l Local optimization (optim) » Gradient methods » Simplex (Nelder-Mead) Global optimization » Simulated Annealing (anneal) » Genetic Algorithms (rgenoud) Evaluating the strength of evidence (“support”) for different parameter estimates - Support Intervals » Asymptotic Support Intervals » Simultaneous Support Intervals The shape of likelihood surfaces around MLEs

Parameter estimation: finding peaks on likelihood “surfaces”. . . The variation in likelihood for any given set of parameter values defines a likelihood “surface”. . . The goal of parameter estimation is to find the peak of the likelihood surface. . (optimization)

Local vs Global Optimization l “Fast” local optimization methods - Large family of methods, widely used for nonlinear regression in commercial software packages l “Brute force” global optimization methods - Grid search global optimum - Genetic algorithms - Simulated annealing local optimum

Local Optimization – Gradient Methods l Derivative-based (Newton-Raphson) methods: Likelihood surface General approach: Vary parameter estimate systematically and search for zero slope in the first derivative of the likelihood function. . . (using numerical methods to estimate the derivative, and checking the second derivative to make sure it is a maximum, not a minimum)

Local Optimization – No Gradient l The Simplex (Nelder Mead) method - Much simpler to program - Does not require calculation or estimation of a - derivative No general theoretical proof that it works, (but lots of happy practitioners…)

Global Optimization “Virtually nothing is known about finding global extrema in general. ” “There are tantalizing hints that so-called “annealing methods” may lead to important progress on global (optimization). . . ” Quote from Press et al. (1986) Numerical Recipes

Global Optimization – Grid Searches l Simplest form of optimization (and rarely used in practice) - Systematically search parameter space at a grid of points l Can be useful for visualization of the broad features of a likelihood surface

Global Optimization – Genetic Algorithms l Based on a fairly literal analogy with evolution - Start with a reasonably large “population” of parameter sets - Calculate the “fitness” (likelihood) of each individual set of parameters - Create the next generation of parameter sets based on the fitness of the “parents”, and various rules for recombination of subsets of parameters (genes) - Let the population evolve until fitness reaches a maximum asymptote

Global optimization - Simulated Annealing l l l Analogy with the physical process of annealing: - Start the process at a high “temperature” - Gradually reduce the temperature according to an annealing schedule Always accept uphill moves (i. e. an increase in likelihood) Accept downhill moves according to the Metropolis algorithm: p = probability of accepting downhill move Dlh = magnitude of change in likelihood t = temperature

Effect of temperature (t)

Simulated Annealing in practice. . . A version with automatic adjustment of range. . . Search range (step size) Lower bound Current value Upper bound REFERENCES: Goffe, W. L. , G. D. Ferrier, and J. Rogers. 1994. Global optimization of statistical functions with simulated annealing. Journal of Econometrics 60: 65 -99. Corana et al. 1987. Minimizing multimodal functions of continuous variables with the simulated annealing algorithm. ACM Transactions on Mathematical Software 13: 262 -280

Constraints – setting limits for the search. . . l Biological limits - Values that make no sense biologically (be careful. . . ) l Algebraic limits - Values for which the model is undefined (i. e. dividing by zero. . . ) Bottom line: global optimization methods let you cast your net widely, at the cost of computer time. . .

Simulated Annealing - Initialization l Set - Annealing schedule » » Initial temperature (t) (3. 0) Rate of reduction in temperature (rt) (0. 95)N Interval between drops in temperature (nt) (100) Interval between changes in range (ns) (20) - Parameter values » Initial values (x) » Upper and lower bounds (lb, ub) » Initial range (vm) Typical values in blue. . .

Simulated Annealing – Step 1 Pick a new set of parameter values (by varying just 1 parameter) Begin {a single iteration} {copy the current parameter array (x) to a temporary holder (xp) for this iteration} xp : = x; {choose a new value for the parameter in use (puse)} xp[puse] : = x[puse] + ((random*2 - 1)*vm[puse]); vm is the range lb is the lower bound ub is the upper bound {check if the new value is out of bounds } if xp[puse] < lb[puse] then xp[puse] : = x[puse] - (random * (x[puse]-lb[puse])); if xp[puse] > ub[puse] then xp[puse] : = x[puse] + (random * (ub[puse]-x[puse]));

Simulated Annealing – Step 2 {call the likelihood function with the new set of parameter values} likeli(xp, fp); {fp = new likelihood} {accept the new values if likelihood increases or at least stays the same} if (fp >= f) then begin Accept the step x : = xp; if it leads uphill. . . f : = fp; nacp[puse] : = nacp[puse] + 1; if (fp > fopt) then {if this is a new maximum, update the maximum likelihood} begin xopt : = xp; fopt : = fp; opteval : = eval; Best. Fit; {update display of maximum r} end; end

Simulated Annealing – Step 3 else {use Metropolis criteria to determine whether to accept a downhill move } begin try {fp < f, so the code below is a shortcut for exp(-1. 0(abs(f-fp)/t)} p : = exp((fp-f)/t); {t = current temperature} except on EUnderflow do p : = 0; end; pp : = random; Use the Metropolis algorithm to if pp < p then decide whether to accept a begin downhill step. . . x : = xp; f : = fp; nacp[puse] : = nacp[puse] + 1; end;

Simulated Annealing – Step 4 Periodically adjust the range (VM) within which new steps are chosen. . . {after nused * ns cycles, adjust VM so that half of evaluations are accepted} If eval mod (nused*ns) = 0 then begin ns is typically ~ 20 for i : = 0 to npmax do if xvary[i] then begin This part is strictly ad hoc. . . ratio : = nacp[i]/ns; { C controls the adjustment of VM (range) - references suggest setting at 2. 0} if (ratio > 0. 6) then vm[i] : = vm[i]*(1. 0+c[i]*((ratio - 0. 6)/0. 4)) else if ratio < 0. 4 then vm[i] : = vm[i]/(1. 0+c[i]*((0. 4 - ratio)/0. 4)); if vm[i] > (ub[i]-lb[i]) then vm[i] : = ub[i] - lb[i]; end; { reset nacp[i]} for i : = 1 to npmax do nacp[i] : = 0; end;

Effect of C on Adjusting Range. . .

Simulated Annealing Code – Final Step Reduce the “temperature” according to the annealing schedule {after nused * ns * nt cycles, reduce temperature t } If eval mod (nused*ns*nt) = 0 then begin t : = rt * t; {store current maximum lhood in history list} lhist[eval div (nused*ns*nt)]. iter : = eval; lhist[eval div (nused*ns*nt)]. lhood : = fopt; end; I typically set nt = 100 (a very slow annealing) rt = fractional reduction in temperature at each drop in temperature: NOTE: Goffe et al. restart the search at the previous MLE estimates each time the temperature drops. . . (I don’t)

How many iterations? . . . Red maple leaf litterfall (6 parameters) 500, 000 is way more than necessary! Logistic regression of windthrow susceptibility (188 parameters) 5 million is not enough! What would constitute convergence? . . .

Optimization - Summary l No hard and fast rules for any optimization – be willing to explore alternate options. l Be wary of initial values used in local optimization when the model is at all complicated l How about a hybrid approach? Start with simulated annealing, then switch to a local optimization…

Evaluating the strength of evidence for the MLE Now that you have an MLE, how should you evaluate it? (Hint: think about the shape of the likelihood function, not just the MLE)

Strength of evidence for particular parameter estimates – “Support” Log-likelihood = “Support” (Edwards 1992) l Likelihood provides an objective measure of the strength of evidence for different parameter estimates. . .

Fisher’s “Score” and “Information” l “Score” (a function) = First derivative (slope) of the likelihood function - So, S(θ) = 0 at the maximum likelihood estimate of θ l “Information” (a number) = -1 * Second derivative (acceleration) of the likelihood function, evaluated at the MLE. . - So this is a number: a measure of how steeply likelihood drops off as you move away from the MLE - In general cases, “information” is equivalent to the variance of the parameter…

Profile Likelihood Evaluate support (information) for a range of values of a given parameter by treating all other parameters as “nuisance” and holding them at their MLEs… Parameter 2 l Parameter 1

Asymptotic vs. Simultaneous M-Unit Support Limits l Asymptotic Support Limits (based on Profile Likelihood): - Hold all other parameters at their MLE values, and systematically vary the remaining parameter until likelihood declines by a chosen amount (m). . . What should “m” be? (2 is a good number, and is roughly analogous to a 95% CI)

Asymptotic vs. Simultaneous M-Unit Support Limits l Simultaneous: - Resampling method: draw a very large number of random sets of parameters and calculate loglikelihood. M-unit simultaneous support limits for parameter xi are the upper and lower limits that don’t differ by more than m units of support. . . In practice, it can require an enormous number of iterations to do this if there are more than a few parameters

Asymptotic vs. Simultaneous Support Limits Parameter 2 A hypothetical likelihood surface for 2 parameters. . . Simultaneous 2 -unit support limits for P 1 2 -unit drop in support Asymptotic 2 -unit support limits for P 1 Parameter 1

Other measures of strength of evidence for different parameter estimates l Edwards (1992; Chapter 5) - Various measures of the “shape” of the likelihood surface in the vicinity of the MLE. . . How pointed is the peak? . . .

Bootstrap methods l Bootstrap methods can be used to estimate the variances of parameter estimates - In simple terms: » generate many replicates of the dataset by sampling with replacement (bootstraps) » Estimate parameters for each of the datasets » Use the variance of the parameter estimates as a bootstrap estimate of the variance

Evaluating Support for Parameter Estimates: A Frequentist Approach l Traditional confidence intervals and standard errors of the parameter estimates can be generated from the Hessian matrix - Hessian = matrix of second partial derivatives of the likelihood function with respect to parameters, evaluated at the maximum likelihood estimates - Also called the “Information Matrix” by Fisher - Provides a measure of the steepness of the likelihood surface in the region of the optimum - Can be generated in R using optim and fd. Hess

Example from R The Hessian matrix (when maximizing a log likelihood) is a numerical approximation for Fisher's Information Matrix (i. e. the matrix of second partial derivatives of the likelihood function), evaluated at the point of the maximum likelihood estimates. Thus, it's a measure of the steepness of the drop in the likelihood surface as you move away from the MLE. > res$hessian a a -150. 182 b -2758. 360 sd -0. 202 b -2758. 360 -67984. 416 -5. 926 sd -0. 201 -5. 925 -299. 422 (sample output from an analysis that estimates two parameters and a variance term)

More from R now invert the negative of the Hessian matrix to get the matrix of parameter variance and covariance > solve(-1*res$hessian) a b a 2. 613229 e-02 -1. 060277 e-03 b -1. 060277 e-03 5. 772835 e-05 sd 3. 370998 e-06 -4. 278866 e-07 3. 339775 e-03 the square roots of the diagonals of the inverted negative Hessian are the standard errors* > sqrt(diag(solve(-1*res$hessian))) a b sd 0. 1616 0. 007597 0. 05779 (*and 1. 96 * S. E. is a 95% C. I…. )