Bayes Factors and inference Greg Francis PSY 626

Bayes Factors and inference Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2020 Purdue University

Bayes Factor The ratio of likelihood for the data under the null compared to the alternative l w Nothing special about the null, it compares any two models l Likelihoods are averages across different possible parameter values specified by the model by a prior distribution

What does it mean? l Guidelines BF Evidence 1– 3 3 – 10 10 – 30 30 – 100 >100 Anecdotal Substantial Strong Very strong Decisive

Evidence for the null l BF 01>1 implies (some) support the null hypothesis l Evidence for “invariances” l This is more or less impossible for NHST l It is a useful measure l Consider a study in Psychological Sciences l Liu, Wang & Jiang (2016). Conscious Access to Suppressed Threatening Information Is Modulated by Working Memory

Working memory face emotion l l Explored whether keeping a face in working memory influenced its visibility under continuous flash suppression To insure subjects kept face in memory, tested for identity

Working memory face emotion l l Different types of face emotions: fearful face, neutral face No significant differences of correct responses (same/different) for emotions: w Experiment 1: t(11)= -1. 74, p=0. 110 l If we compute the JZS Bayes Factor we get l > ttest. tstat(t=-1. 74, n 1=12, simple=TRUE) l B 10 l 0. 9240776 l Which is anecdotal support for the null hypothesis l You would want B 10< 1/3 for substantial support for the null

Replications l Experiment 3 w t(11)=-1. 62, p=. 133 l Experiment 4 w t(13)=-1. 37, p=. 195 l l Converting to JZS Bayes Factors suggests these are modest support for the null Experiment 3 w ttest. tstat(t= -1. 62, n 1=12, simple=TRUE) w B 10 w 0. 8033315 l Experiment 4 w ttest. tstat(t= -1. 37, n 1=14, simple=TRUE) w B 10 w 0. 5857839

The null result matters l The authors wanted to demonstrate that faces with different emotions were equivalently represented in working memory l But differently affected visibility during the flash suppression part of a trial l Experiment 1: l Reaction times for seeing a face during continuous flash suppression were shorter for fearful faces than for neutral faces w Main effect of emotion: F(1, 11)=5. 06, p=0. 046 l Reaction times were shorter when the emotion of the face during continuous flash suppression matched the emotion of the face in working memory w Main effect of congruency: F(1, 11)=11. 86, p=0. 005

Main effects l We will talk about a Bayesian ANOVA later, but we can consider the ttest equivalent of these tests: l Effect of emotion l > ttest. tstat(t= sqrt(5. 06), n 1=12, simple=TRUE) l B 10 l 1. 769459 l Suggests anecdotal support for the alternative hypothesis l Effect of congruency l l ttest. tstat(t= sqrt(11. 86), n 1=12, simple=TRUE) B 10 l 9. 664241 l Suggests substantial support for the alternative hypothesis

Evidence l It is generally harder to get convincing evidence (BF>3 or BF>10) than to get p<. 05 l Interaction: F(1, 11)=4. 36, p=. 061 l Contrasts: w RT for fearful faces shorter if congruent with working memory: t(11)=-3. 59, p=. 004 w RT for neutral faces unaffected by congruency t(11)=-0. 45 l Bayesian interpretations of t-tests: w > ttest. tstat(t=-3. 59, n 1=12, simple=TRUE) w B 10 w 11. 94693 w > ttest. tstat(t=-0. 45, n 1=12, simple=TRUE) w B 10 w 0. 3136903

Substantial Evidence l For a two-sample t-test (n 1=n 2=10), a BF>3 corresponds to p<0. 022 l For a two-sample t-test (n 1=n 2=100), a BF>3 corresponds to p<0. 012 l For a two-sample t-test (n 1=n 2=1000), a BF>3 corresponds to p<0. 004

Strong Evidence l For a two-sample t-test (n 1=n 2=10), a BF>10 corresponds to p<0. 004 l For a two-sample t-test (n 1=n 2=100), a BF>10 corresponds to p<0. 003 l For a two-sample t-test (n 1=n 2=1000), a BF>10 corresponds to p<0. 001 l Of course, if you change your prior you change these values w (but not much) l Setting the scale parameter r=sqrt(2) (ultra wide) gives l For a two-sample t-test (n 1=n 2=10), a BF>10 corresponds to p<0. 005 l For a two-sample t-test (n 1=n 2=100), a BF>10 corresponds to p<0. 0017 l For a two-sample t-test (n 1=n 2=1000), a BF>10 corresponds to p<0. 00054

Bayesian meta-analysis l Rouder & Morey (2011) identified how to combine replication studies to produce a JZS Bayes Factor that accumulates the information across experiments l The formula for a one-sample, one-tailed t-test for BF 10 is l f( ) is the Cauchy (or half-Cauchy) distribution l g( ) is the non-central t distribution l It looks complicated, but it is easy enough to calculate

Bayesian meta-analysis l Consider the null results on face emotion and memorability l Experiment 1 w t(11)= -1. 74, p=0. 110 l Experiment 3 w t(11)=-1. 62, p=. 133 l Experiment 4 w t(13)=-1. 37, p=. 195 l Strong support for the alternative! > tvalues<-c(-1. 74, -1. 62, -1. 37) > nvalues<-c(12, 14) > meta. ttest. BF(t=tvalues, n 1=nvalues) Bayes factor analysis -------[1] Alt. , r=0. 707 : 4. 414733 ± 0% Against denominator: Null, d = 0 --Bayes factor type: BFmetat, JZS

Equivalent statistics l l Bayes Factors are not magic, and they use the very same information as other approaches to statistical inference Consider a variety of statistics for different inferential methods w Standardized effect size (Cohen’s d, Hedge’s g) w Confidence interval for d or g w JZS Bayes Factor w Akaiki Information Criterion (AIC) w Bayesian Information Criterion (BIC)

Equivalent statistics l l For a 2 -sample t-test with known sample sizes n 1 and n 2, all of these statistics are mathematically equivalent to each other Given one statistic, you can compute all the others w Francis, G. (2017). Equivalent statistics and data interpretation. Behavior Research Methods, 49, 1524 -1538. w http: //psych. purdue. edu/~gfrancis/Equivalent. Statistics/ l You should use the statistic that is appropriate for the inference you want to make

Equivalent statistics Each of these statistics is a “sufficient statistic” for the population effect size l l l A data set provides an estimate of the population effect size It is “sufficient” because knowing the whole data set provides no more information about δ than just knowing d

Equivalent statistics: d, t, p l l l Any invertible transformation of a sufficient statistic is also sufficient For example, Similarly, a t value corresponds to a unique p value

Equivalent statistics: CIs l l The variance of Cohen’s d is a function of only sample size and d This means that if you know d and the sample sizes, you can compute either limit of a confidence interval of d If you know either limit of a confidence interval of d you can also compute d You get no more information about the data set by reporting a confidence interval of d than by reporting a p value

Equivalent statistics: Likelihood l Many statistics are based on likelihood w Essentially the “probability” of the observed data, given a specific model (not quite probability because a specific value of a continuous variable has probability zero- so it is a product of the probability density function values) l For a two-sample t-test, the alternative hypothesis (full model) is that a score from group s (1 or 2) is defined as l With different means for each group s l Likelihood for the full model is then:

Equivalent statistics: Likelihood l l l For a two-sample t-test, the null hypothesis (reduced model) is that a score from group s (1 or 2) is defined as With the same mean for each group s These calculations always use estimates of the mean and standard deviation that maximize the likelihood value for that model

Equivalent statistics: Likelihood l l Compare the full (alternative) model against the reduced (null) model Log likelihood ratio Because the reduced model is a special case of the full model, LF > LR If Λ is sufficiently big, you can argue that the full model is better than the reduced model w Likelihood test

Equivalent statistics: t, Likelihood l No new information here l Let n = n 1 + n 2 l Then

Equivalent statistics: AIC l As we saw earlier, just adding complexity to a model will make its claims unreplicable w The model ends up “explaining” random noise w The model will poorly predict future random samples l A better approach is to adjust the likelihood to consider the complexity of the model w Models are penalized for complexity l Akaiki Information Criterion (AIC) w Smaller (more negative) values are better

Equivalent statistics: AIC l For a two-sample t-test, we can compare the full (alternative, 3 parameters) model and the reduced (null, 2 parameters) model l When ΔAIC>0, choose the full model l When ΔAIC<0, choose the null model

Equivalent statistics: AIC l For small sample sizes, you will do better with a “corrected” formula l So, for a two-sample t-test l When ΔAICc>0, choose the full model l When ΔAICc<0, choose the null model l The chosen model is expected to do the better job of predicting future data w This does not mean it will do a “good” job, maybe both models are bad

Equivalent statistics: AIC l l Model selection based on AIC is appropriate when you want to predict future data, but you do not have a lot of confidence that you have an appropriate model You expect the model to change with future data w Perhaps guided by the current model l l To me, this feels like a lot of research in experimental psychology The calculations are based on the very same information in a data set as the t-value, d-value, and p-value

Equivalent statistics: AIC l Inference based on AIC is actually more lenient than the traditional criterion for p-values

Equivalent statistics: BIC l l Decisions based on AIC are not guaranteed to pick the “correct” model An alternative complexity correction does better in this regard l Bayesian Information Criterion l For a two-sample t-test

Equivalent statistics: BIC l Inference based on BIC is much more stringent than the traditional criterion for p-values

Equivalent statistics: JZS BF l l AIC and BIC use the “best” (maximum likelihood) model parameters A fully Bayesian approach is to average likelihood across plausible parameters w Requires a prior probability density function l Compute the ratio of average likelihood for the full (alternative) and reduced (null) model w Bayes Factor l The JZS prior is for the standardized effect size w It’s Bayes Factor is simply a function of t and the sample sizes w It contains no more information about the data set than a p-value

Equivalent statistics: JZS BF l Inference based on the JZS Bayes Factor is much more stringent than the traditional criterion for p-values

Equivalent statistics: JZS BF l l Model selection based on BIC and the JZS Bayes Factor is guaranteed to select the “true” model, if it is being tested So, if you think you understand a situation well enough that you can identify plausible “true” models, then the BIC or Bayes Factor process is a good choice for identifying the true model

Equivalent statistics l I created a web site to do the conversions between statistics w http: //psych. purdue. edu/~gfrancis/Equivalent. Statistics/ l Also computes other relevant statistics (e. g. , post hoc power)

Equivalent statistics l l The various statistics are equivalent, but that does not mean you should report whatever you want It means you should think very carefully about your analysis w Do you want to predict future data? w Do you think you can identify the “true” model? w Do you want to control the Type I error rate? w Do you want to estimate the effect size? l You also need to think carefully about whether you can satisfy the requirements of the inference w Can you avoid optional stopping in data collection? w Is your prior informative?

What should we do? l The first step is to identify what you want to do w Not as easy as it seems w “Produce a significant result” is not an appropriate answer l Your options are basically: w 1) Control Type I error: [identify an appropriate sample size and fix it; identify the appropriate analyses and adjust the significance criterion appropriately; do not include data from any other studies (past or future)] w 2) Estimate an effect size: [sample until you have a precise enough measurement; have to figure what “precise enough” means; explore/describe the data without drawing conclusions] w 3) Find the “true” model: [sample until the Bayes Factor provides overwhelming evidence for one model versus other models; have to identify prior distributions of “belief” in those models; have to believe that the true model is among the set being considered] w 4) Find the model that best predicts future data: [machine learning techniques such as cross-validation; information criterion; be willing to accept that your current model is probably wrong]

Equivalent statistics l l l Common statistics are equivalent with regard to the information in the data set But no method of statistical inference is appropriate for every situation The choice of what to do can give radically different answers to seeming similar questions w n 1=n 2=250, d=0. 183 w p=0. 04 w ΔBIC= - 2. 03 (evidence for null) w ΔAICc = 2. 16 (full model better predicts future data than the null model) w JZS Bayes Factor = 0. 755 (weak evidence that slightly favors the null model)

What should we do? l Do you even need to make a decision? (choose a model, reject a null) w Oftentimes the decision of a hypothesis test is really just a description of the data l l When you make a decision you need to consider the context (weigh probabilities and utilities) For example, suppose a teacher needs to improve mean reading scores by 7 points for a class of 30 students w Approach A (compared to current method): , s=5, d=1. 2 w Approach B (compared to current method): , s=50, d=0. 1 A: P(Mean>7)=0. 14 B: P(Mean>7)=0. 41

Conclusions l These differences make sense because science involves many different activities at different stages of investigation w Discovery w Theorizing w Verification w Prediction w Testing l Bayes Factors fit into part (but not all) of these activities