1 Phystat 05 Highlights MKU PHYSTAT 05 Highlights

  • Slides: 46
Download presentation
1 Phystat 05 Highlights MKU PHYSTAT 05 Highlights: Statistical Problems in Particle Physics, Astrophysics

1 Phystat 05 Highlights MKU PHYSTAT 05 Highlights: Statistical Problems in Particle Physics, Astrophysics and Cosmology Müge Karagöz Ünel Oxford University College London 03/11/2006

2 Phystat 05 Highlights MKU Outline • • • Conference Information and History Introduction

2 Phystat 05 Highlights MKU Outline • • • Conference Information and History Introduction to statistics Selection of hot topics Available tools Astrophysics and cosmology Conclusions

MKU Phystat 05 Highlights PHYSTAT History 3

MKU Phystat 05 Highlights PHYSTAT History 3

MKU Phystat 05 Highlights 4

MKU Phystat 05 Highlights 4

MKU Phystat 05 Highlights 5 Chronology of PHYSTAT 05 Where CERN Fermilab Durham SLAC

MKU Phystat 05 Highlights 5 Chronology of PHYSTAT 05 Where CERN Fermilab Durham SLAC Oxford When Jan 2000 March 2002 Sept 2003 Sept 2005 Issues Limits Wider range of topics Physicists Particles +3 astrophysicists Particles + Astro +Cosmo Particles + Astro + Cosmo Many Statisticians 3 3 Particles +3 astrophysicists 2

6 Phystat 05 Highlights MKU PHYSTAT 05 Programme 7 Invited talks by Statisticians 9

6 Phystat 05 Highlights MKU PHYSTAT 05 Programme 7 Invited talks by Statisticians 9 Invited talks by Physicists 38 Contributed talks 8 Posters Panel Discussion 3 Conference Summaries 90 participants

7 Phystat 05 Highlights MKU Invited Talks by Statisticians David Cox Keynote Address: Bayesian,

7 Phystat 05 Highlights MKU Invited Talks by Statisticians David Cox Keynote Address: Bayesian, Frequentists & Physicists Steffen Lauritzen Goodness of Fit Jerry Friedman Machine Learning Susan Holmes Visualisation Peter Clifford Time Series Mike Titterington Deconvolution Nancy Reid Conference Summary (Statistics)

8 Phystat 05 Highlights MKU Invited Talks by (Astro+)Physicists Bob Cousins Nuisance Parameters for

8 Phystat 05 Highlights MKU Invited Talks by (Astro+)Physicists Bob Cousins Nuisance Parameters for Limits Kyle Cranmer LHC discovery Alex Szalay Astrophysics + Terabytes Jean-Luc Starck Multiscale geometry Jim Linnemann Statistical Software for Particle Physics Bob Nichol Statistical Software for Astrophysics Stephen Johnson Historical Transits of Venus Andrew Jaffe Conference Summary (Astrophysics) Gary Feldman Conference Summary (Particles)

9 Phystat 05 Highlights MKU Contents of the Proceedings Bayes/Frequentist 5 talks Goodness of

9 Phystat 05 Highlights MKU Contents of the Proceedings Bayes/Frequentist 5 talks Goodness of Fit 5 Likelihood/parameter estimation 6 Nuisance parameters/limits/discovery 10 Machine learning 7 Software 8 Visualisation 1 Astrophysics 5 Time series 1 Deconvolution 3

10 Phystat 05 Highlights MKU Statistics in (A/P)Physics

10 Phystat 05 Highlights MKU Statistics in (A/P)Physics

11 Phystat 05 Highlights MKU Statistics in (Particle) Physics An experiment goes through following

11 Phystat 05 Highlights MKU Statistics in (Particle) Physics An experiment goes through following stages: • Prepare conditions for taking data for a particle X ( if theory driven) • Record events that might be X and reconstruct the measurables • Select events that could have X by applying criteria (cuts) • Generate histograms of variables and ask the questions: Is there any evidence for new things or is the null hypothesis unrefuted? If there is evidence, what are the estimates for parameters of X? (Confrontation of theory with experiment or v. v. ) • The answers can come via your favorite statistical technique (depends on how you ask the question)

(from S. Andreon’s web page) 12 MKU Phystat 05 Highlights (yet another) Chronology •

(from S. Andreon’s web page) 12 MKU Phystat 05 Highlights (yet another) Chronology • Homo apriorius establishes probability of an hypothesis, no matter what data tell. • Homo pragamiticus establishes that it is interested by the data only. • Homo frequentistus measures probability of the data given the hypothesis. • Homo sapiens measures probability of the data and of the hypothesis. • Homo bayesianis measures probability of the hypothesis, given the data.

13 Phystat 05 Highlights MKU Bayesian vs Frequentist We need to make a statement

13 Phystat 05 Highlights MKU Bayesian vs Frequentist We need to make a statement about Parameters, given Data Bayes 1763 Frequentism 1937 Both analyse data (x) statement about parameters ( ) Both use Prob (x; ), e. g. Prob ( ) = 90% but very different interpretation Bayesian : Probability (parameter, given data) Frequentist : Probability (data, given parameter) “Bayesians address the question everyone is interested in, by using assumptions no-one believes” “Frequentists use impeccable logic to deal with an issue of no interest to anyone”

14 Phystat 05 Highlights MKU Goodness of Fit Lauritzen Invited talk - Go. F

14 Phystat 05 Highlights MKU Goodness of Fit Lauritzen Invited talk - Go. F Yabsley Go. F and sparse multi-D data Ianni Go. F and sparse multi-D data Raja Go. F and L Gagunashvili 2 and weighting Pia Software Toolkit for Data Analysis Block Rejecting outliers Bruckman Alignment Blobel Tracking

15 Phystat 05 Highlights MKU Goodness of Fit • We would like to: know

15 Phystat 05 Highlights MKU Goodness of Fit • We would like to: know if a given distribution is of a specified type, test the validation of a postulated model, . . • A few Go. F tests are widely used in practice: – 2 test: most widely used application is 1 or 2 D fits to data – G 2 (the likelihood ratio statistics) test: the general version of 2 test (Lauritzen’s personal choice) – Kolmogorov-Smirnov test: a robust but prone to mislead test, can be used to confirm, say, two distributions (histograms) are the same by calculating the p-value for the difference hypothesis. – Other new methods, like Aslan&Zech’s energy test, exist…

Direct Least-Squares solution to the Silicon Tracker alignment problem The method consists of minimizing

Direct Least-Squares solution to the Silicon Tracker alignment problem The method consists of minimizing the giant 2 resulting from a simultaneous fit of all particle trajectories and alignment parameters: Intrinsic measurement error + MCS track 16 Phystat 05 Highlights MKU An example from ATLAS (Bruckman) hit Let us consequently use the linear expansion (we assume all second order derivatives are negligible). The track fit is solved by: residual while the alignment parameters are given by: Key relation! Systems large: inherent Computational challenges Equivalent to Millepede approach from V. Blobel

17 Phystat 05 Highlights MKU Nuisance Parameters/Limits/Discovery Cousins Limits and Nuisance Params Reid Respondent

17 Phystat 05 Highlights MKU Nuisance Parameters/Limits/Discovery Cousins Limits and Nuisance Params Reid Respondent Punzi Frequentist multi-dimensional ordering rule Tegenfeldt Feldman-Cousins + Cousins-Highland Rolke Limits Heinrich Bayes + limits Bityukov Poisson situations Hill Limits v Discovery (see Punzi @ PHYSTAT 2003) Cranmer LHC discovery and nuisance parameters

18 Systematics MKU Phystat 05 Highlights Note: Systematic errors (HEP) <-> nuisance params (statistician)

18 Systematics MKU Phystat 05 Highlights Note: Systematic errors (HEP) <-> nuisance params (statistician) An example: we need to know these, Observed Physics parameter probably from other measurements (and/or theory) for statistical errors Uncertainties error in Some arguably statistical errors

19 Phystat 05 Highlights MKU Nuisance Parameters • Nuisance parameters are parameters with unknown

19 Phystat 05 Highlights MKU Nuisance Parameters • Nuisance parameters are parameters with unknown true values. They may be: – statistical, such as number of background events in a sideband used for estimating the background under a peak. – systematic, such as the shape of the background under the peak, or the error caused by the uncertainty of the hadronic fragmentation model in the Monte Carlo. – Most experiments have a large number of systematic uncertainties. – If the experimenter is blind to these uncertainties, they become a bigger nuisance!

20 Phystat 05 Highlights MKU Issues with LHC • LHC will collide 40 million

20 Phystat 05 Highlights MKU Issues with LHC • LHC will collide 40 million times/sec and collect petabytes of data. pp collisions at 14 Te. V will generate events much more complicated than LEP, Te. Vatron. • Kyle Cranmer has pointed out that systematic issues will be even more important at the LHC. – If the statistical error is O(1) and systematic error is O(0. 1), it does not much matter how you treat it. – However, at the LHC, we may have processes with 100 background events and 10% systematic errors, this is not negligible. – Even more critical, we want 5 s for a discovery level.

21 Phystat 05 Highlights MKU Why 5 s? (Feldman+Cranmer) • LHC searches: 500 searches

21 Phystat 05 Highlights MKU Why 5 s? (Feldman+Cranmer) • LHC searches: 500 searches each of which has 100 resolution elements (mass, angle bins, . . . ) = 5 x 104 chances to find something. • One experiment: False positive rate at 5 s (5 x 104) (3 x 10 -7) = 0. 015. OK. • Two experiments: – Assume allowable false positive rate: 10. – 2 (5 x 104) (1 x 10 -4) = 10 3. 7 s required. – Required other experiment verification, assume rate 0. 01: (1 x 10 -3)(10) = 0. 01 3. 1 s required. • Caveats: Is the significance real? Are there common systematic errors?

22 Phystat 05 Highlights MKU Confidence Intervals • Various techniques discussed during conference. Most

22 Phystat 05 Highlights MKU Confidence Intervals • Various techniques discussed during conference. Most concerns were summarized by Feldman. – Bayesian: good method but Heinrich showed that flat priors in multi-D may lead to undesirable results (undercoverage). – Frequentist-Bayesian hybrids: Bayesian for priors and frequentist to extract range. Cranmer considered this for LHC (which was also used at Higgs searches). – Profile likelihood: shown by Punzi to have issues when distribution is Poisson-like. – Full Neyman reconstruction: Cranmer and Punzi attempted this, but is not feasible for large number of nuisance parameters. • Banff workhsop of this summer was found useful in comparing various methods. The real suggestions for LHC will likely come from 2007 workshop on LHC issues.

23 Phystat 05 Highlights MKU Event Classification • The problem: Given a measurement of

23 Phystat 05 Highlights MKU Event Classification • The problem: Given a measurement of an event X find F(X) which returns 1 if the event is signal (s) and 0 if the event is background (b) to optimize a figure of merit, say, s/√b for discovery and s/ √(s+b) for established signal. • Theoretical solution: Use MC to calculate the likelihood ratio Ls(X)/Lb(X) and derive F(X) from it. Unfortunately, this does not work as in a high-dimension space, even the largest data set is sparse. (Feldman) • In recent years, physicists have turned to machine learning: give the computer samples of s and b events and let the computer figure out what F(X) is.

24 Phystat 05 Highlights MKU Multivariate Analysis Friedman Machine learning Prosper Respondent Narsky Bagging

24 Phystat 05 Highlights MKU Multivariate Analysis Friedman Machine learning Prosper Respondent Narsky Bagging Roe Boosting (Miniboone) Gray Bayes optimal classification Bhat Bayesian networks Sarda Signal enhancement

25 Phystat 05 Highlights MKU Multivariates and Machine Learning Various methods exist to classify,

25 Phystat 05 Highlights MKU Multivariates and Machine Learning Various methods exist to classify, train and test events. • Artificial neural networks (ANN): currently the most widely used (examples from Prosper, …) • Decision trees: differentiating variable is used to separate sample into branches until a leaf with a preset number of signal and background events are found. • Trees with rules: combining a series of trees to increase single decision tree power (Friedman) • Bagging (Bootstrap AGGregat. ING) trees: build a collection of trees by selecting a sample of the training data (Narsky) • Boosted trees: a robust method that gives misclassified events in one tree a higher weight in the generation of a new tree Comparisons of significance were performed, but not all of were controlled experiments, so conclusions may be deceptive until further tests. .

26 Phystat 05 Highlights MKU Ex: Boosted Decision Trees (Roe) • An nice example

26 Phystat 05 Highlights MKU Ex: Boosted Decision Trees (Roe) • An nice example from Mini. Boone • Create M many trees and take the final score for signal and background as weighted sum of individual trees Decision tree Boosting the tree

27 Phystat 05 Highlights MKU Punzi effect (getting L wrong) Giovanni Punzi @ PHYSTAT

27 Phystat 05 Highlights MKU Punzi effect (getting L wrong) Giovanni Punzi @ PHYSTAT 2003 “Comments on L fits with variable resolution” Separate two close signals (A and B) , when resolution σ varies event by event, and is different for 2 signals e. g. M, Different numbers of tracks different σM Avoiding Punzi bias • Include p(σ|A) and p(σ|B) in fit OR • Fit each range of σi separately, and add (NA)i (NA)total, and similarly for B Beware of event-by-event variables and construct likelihoods accordingly (Talk by Catastini)

28 Phystat 05 Highlights MKU Blind Analyses Potential problem: Experimenters’ bias Original suggestion? Luis

28 Phystat 05 Highlights MKU Blind Analyses Potential problem: Experimenters’ bias Original suggestion? Luis Alvarez Methods of blinding: • • Keep signal region box closed Add random numbers to data Keep Monte Carlo parameters blind Use part of data to define procedure A number of analyses in experiments doing blind searches Don’t modify result after unblinding, in general. . Question: Will LHC experiments choose to be blind? In which analysis?

29 Phystat 05 Highlights MKU Astrophysics + Cosmology Highlights

29 Phystat 05 Highlights MKU Astrophysics + Cosmology Highlights

30 Phystat 05 Highlights MKU Astro/Cosmo General Issues ‘“There is only one universe” and

30 Phystat 05 Highlights MKU Astro/Cosmo General Issues ‘“There is only one universe” and some experiments can never be rerun’ – A. Jaffe (concluding talk) Astro+cosmo tend to be more Bayesian, by nature. Cosmologists “Astronomers” Bayesians Particle Physicists Frequentists • Virtual Observatories: all astro data available from desktop • Data volume growth doubling every year, most data are on the web (Szalay) – Bad: computing & storage issues – Good (? ): Systematic errors more significant statistical errors • Nichol discussed using grid techniques.

31 Phystat 05 Highlights MKU Astrophysics: Various Hot Points • Flat priors have been

31 Phystat 05 Highlights MKU Astrophysics: Various Hot Points • Flat priors have been used commonly, but are dangerous (Cox, Le Diberder, Cousins): would be the best quantity to use or is it h 2 ? • Issues with non-gaussian distribution of noise taken into account in the spectrum: a few methods discussed by Starck, Digel, . . • Blind analyses are rare (not so good at a priori modeling!) • Lots of good software in astrophysics and repositories more advanced than PP. • Jaffe’s talk has a a nice usage of CMB as a case study for statistical methods in astrophysics, starting from 1 st principles of Bayesian.

32 Phystat 05 Highlights MKU Software and Available Tools

32 Phystat 05 Highlights MKU Software and Available Tools

33 Phystat 05 Highlights MKU Talks Given on Software Linnemann Nichol Le Diberder Paterno

33 Phystat 05 Highlights MKU Talks Given on Software Linnemann Nichol Le Diberder Paterno Kreschuk Verkerke Pia Buckley Narsky Software for Particles Software for Astro (and Grid) s. Plot R ROOT Roo. Fit Goodness of Fit CEDAR Stat. Pattern. Recognition

34 Phystat 05 Highlights MKU Available Tools • A number of good software has

34 Phystat 05 Highlights MKU Available Tools • A number of good software has become more and more available (good news for LHC!) • PP and astro use somehow different softwares (IDL, IRAF by astro, for ex. ) • 2004 Phystat workshop at MSU on statistical software (mainly on R & ROOT) by Linnemann • Statatisticians have a repository of standard source codes (Stat. Lib): http: //lib. stat. cmu. edu/ • One good output of the conference was a Recommendation of Statistical Software Repository at FNAL • Linnemann has a web page of collections: http: //www. pa. msu. edu/people/linnemann/stat_resources. h tml

35 Phystat 05 Highlights MKU CDF Statistics Committee resources • Documentation about statistics and

35 Phystat 05 Highlights MKU CDF Statistics Committee resources • Documentation about statistics and a repository: http: //wwwcdf. fnal. gov/physics/statistics_home. html

MKU Phystat 05 Highlights 36 Sample Repository Page

MKU Phystat 05 Highlights 36 Sample Repository Page

MKU Phystat 05 Highlights 37 CEDAR & CEPA

MKU Phystat 05 Highlights 37 CEDAR & CEPA

38 Phystat 05 Highlights MKU Summary & Conclusions • Very useful physicists/statisticians interaction e.

38 Phystat 05 Highlights MKU Summary & Conclusions • Very useful physicists/statisticians interaction e. g. Confidence intervals with nuisance parameters, Multivariate techniques, etc. . • Lots of things learnt from • ourselves (by having to present own stuff!) • each other (various different approaches. . ) • statisticians (update on techniques. . ) • A step towards common tools/Software repositories: http: //www. phystat. org (Linnemann) • Programme, transparencies, papers, etc: http: //www. physics. ox. ac. uk/phystat 05 (with useful links such as recommended readings) • Proceedings published by Imperial College Press (Spring ’ 06)

39 Phystat 05 Highlights MKU What is Next? • A few workshops/schools took place

39 Phystat 05 Highlights MKU What is Next? • A few workshops/schools took place since October, 2005 e. g. Manchester (Nov 2005), SAMSI Duke (April 2006), Banff (July 2006), Spanish Summer School (July 2006) • No PHYSTAT Conference in summer 2007 • ATLAS Workshop on Statistical Methods, 18 -19 Jan 2007 • PHYSTAT Workshop at CERN, 27 -29 June 2007 on “Statistical issues for LHC Physics analyses”. (Both workshops will likely aim at discovery significance. Please attend!) • Suggestions/enquiries to: l. lyons@physics. ox. ac. uk • LHC will take data soon. We do not wish to say “The experiment was inconclusive, so we had to use statistics” (inside cover of “the Good Book” by L. Lyons) • rather say We used statistics, and so we are sure that we’ve discovered X (well… with some confidence level!)

40 Phystat 05 Highlights MKU Some Final Notes • Tried to give you a

40 Phystat 05 Highlights MKU Some Final Notes • Tried to give you a collage of PHYSTAT 05 topics. • My deepest thanks to Louis for giving me the chance & introducing me to the PHYSTAT experience! • Apologies to those talks I have not been able to cover… • Thank you for the invitation!

41 Phystat 05 Highlights MKU Backup • • Bayes Frequentist Cousins-Highland Higgs Saga at

41 Phystat 05 Highlights MKU Backup • • Bayes Frequentist Cousins-Highland Higgs Saga at CERN

42 Phystat 05 Highlights MKU Bayesian Approach Bayesian Bayes’ Theorem posterior likelihood Problems: P(param)

42 Phystat 05 Highlights MKU Bayesian Approach Bayesian Bayes’ Theorem posterior likelihood Problems: P(param) True or False “Degree of belief” Prior What functional form? Flat? Which variable? prior

43 Phystat 05 Highlights MKU Frequentist Approach Neyman Construction µ x µ = Theoretical

43 Phystat 05 Highlights MKU Frequentist Approach Neyman Construction µ x µ = Theoretical parameter x = Observation x 0 NO PRIOR

44 Phystat 05 Highlights MKU Frequentist Approach at 90% confidence Frequentist and known, but

44 Phystat 05 Highlights MKU Frequentist Approach at 90% confidence Frequentist and known, but random unknown, but fixed Probability statement about Bayesian and known, and fixed unknown, and random Probability/credible statement about

45 Phystat 05 Highlights MKU A Method: Mixed Frequentist - Bayesian Full frequentist method

45 Phystat 05 Highlights MKU A Method: Mixed Frequentist - Bayesian Full frequentist method hard to apply in several dimensions Bayesian for nuisance parameters and Frequentist to extract range Philosophical/aesthetic problems? Highland Cousins NIM A 320 (1992) 331

46 Phystat 05 Highlights MKU Higgs Saga P (Data; Theory) P (Theory; Data) Is

46 Phystat 05 Highlights MKU Higgs Saga P (Data; Theory) P (Theory; Data) Is data consistent with Standard Model? or with Standard Model + Higgs? End of Sept 2000: Data not very consistent with S. M. Prob (Data ; S. M. ) < 1% valid frequentist statement Turned by the press into: Prob (S. M. ; Data) < 1% and therefore Prob (Higgs ; Data) > 99% i. e. “It is almost certain that the Higgs has been seen”