Applied Statistics for the Office of Science Understanding

  • Slides: 12
Download presentation
Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific

Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and Data Sciences Group Computer Science and Mathematics Division Oak Ridge National Laboratory

Statistics and Data Sciences George Ostrouchov Filling a Gap in Statistics to Address Office

Statistics and Data Sciences George Ostrouchov Filling a Gap in Statistics to Address Office of Science Needs ASCR Strategic Plan “[AMR] weaknesses include an underinvestment or lack of investment in several critical areas: ·. . . · Underinvestment in statistics” U. S. Department of Energy Office of Science “The following gaps in the [AMR] program have been identified: · Multiscale mathematics · Ultrascale algorithms · Discrete mathematics · Statistics – investments in this area are required to deal with extracting knowledge from the oceans of data that largescale simulations will produce. · Multiphysics” Office of Science Response to the Data Challenge: The Office of Science will initiate a long-term research program to address the “Curse of Dimensionality. ” Raymond L. Orbach, AAAS, Feb. 19, 2006 Through Applied Statistics, ASCR has the opportunity to engage the dominant segment of Applied Mathematics for its goals. ORNL Applied Statistics program can address the curse of dimensionality and other Office of Science goals. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY

Statistics and Data Sciences George Ostrouchov Statistics Brings Rigor and Efficiency to Scientific Investigation

Statistics and Data Sciences George Ostrouchov Statistics Brings Rigor and Efficiency to Scientific Investigation and Technology · · Karl Pearson (1857 -1936) “The Grammar of Science” (1892) – Relativity First Department of Statistics (1911) UCL Founding editor of Biometrika EXPERIMENTAL Conrad Habicht, Maurice Solovine, and Albert Einstein, the self-styled Olympia Academy, in about 1903. At Einstein’s suggestion, the first book read was Pearson’s “The Grammar of Science. ” CREDIT: IMAGE ARCHIVE ETH-BIBLIOTHEK, ZÜRICH OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY

Statistics and Data Sciences George Ostrouchov Common Evolutionary Steps: Experimental Science and Computational Science

Statistics and Data Sciences George Ostrouchov Common Evolutionary Steps: Experimental Science and Computational Science · Early computational science relies largely on intuitive design and visual validation - Computational experiments are expensive - Petascale data sets are nearly as opaque as real systems – statistical analysis must select what to visualize - Uncertainty analysis is in its infancy · Statistics is a major partner in bringing computational science to the rigor and efficiency standards of experimental science - Methods to see through, examine, and classify variability Uncertainty quantification Statistical design of experiments Fusion of data and computational experiment OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY

Statistics and Data Sciences George Ostrouchov Statistics: the Study of Variability · The discipline

Statistics and Data Sciences George Ostrouchov Statistics: the Study of Variability · The discipline concerned with the study of variability, with the study of uncertainty, and with the study of decision-making in the face of uncertainty. · Large scale user of mathematical and computational tools with a focused scientific agenda · Inherently interdisciplinary Source: [NSF 2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century, Cuts through the fog of variability and brings efficiency to science. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY

Statistics and Data Sciences George Ostrouchov Mathematics, Mathematics Computer Science, and Statistics areis. Particle

Statistics and Data Sciences George Ostrouchov Mathematics, Mathematics Computer Science, and Statistics areis. Particle Biology’s Chemistry’s Materials’ Physics’ Next Microscope, Device, Astrophysics’ Telescope Only Better Cohen JE (2004). PLo. S Biol 2(12): e 439 Computer Science and Mathematics Multiscale Math Statistics Computer Science Fellow AAAS, Fellow Am. Phil. Soc, Member NAS Here are five mathematical challenges that would contribute to the progress of biology. (1) Understand computation. Find more effective ways to gain insight and prove theorems from numerical or symbolic computations and agent-based models. We recall Hamming: “The purpose of computing is insight, not numbers” (Hamming 1971, p. 31). (2) Find better ways to model multi-level systems, for example, cells within organs within people in human communities in physical, chemical, and biotic ecologies. (3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we are still at the very beginning of a true understanding. Can we understand uncertainty and risk better by integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is an entirely new approach required? (4) Understand data mining, simultaneous inference, and statistical de-identification (Miller 1981). Are practical users of simultaneous statistical inference doomed to numerical simulations in each case, or can general theory be improved? What are the complementary limits of data mining and statistical de-identification in large linked databases with personal information? (5) Set standards for clarity, performance, publication and permanence of software and computational results. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY

Statistics and Data Sciences Particle Physics Embraces Statistics “… since 1900 … statistics …

Statistics and Data Sciences Particle Physics Embraces Statistics “… since 1900 … statistics … takes over field after field … [as] … the methodology of choice … … people in astronomy and physics … are starting to use statistics a lot more for the simple reason that they have to be efficient now. … I don't see any area where it's being resisted much. ” Bradley Efron OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Chair, Department of Statistics, Stanford University and Max H. Stein Professor of Humanities and Sciences 2005 National Medal of Science Recipient George Ostrouchov

Statistics and Data Sciences George Ostrouchov Citations to Statistics Comprise the Dominant Group within

Statistics and Data Sciences George Ostrouchov Citations to Statistics Comprise the Dominant Group within Mathematics Highly Cited Journals in Mathematics Rank Journal 1991 -2001 Citations 1. J. American Statistical Assn. 16, 457 2. Biometrics 10, 854 3. J. Math. Analysis 9, 845 4. Annals of Statistics 9, 702 5. Proc. Amer. Math Soc. 9, 237 6. C. R. Acad. Sci. Ser. I Math. 9, 153 7. Trans. Amer. Math. Soc. 8, 586 8. Journal of Algebra 8, 531 9. J. Functional Analysis 7, 999 10. Biometrika 7, 911 11. SIAM J. Numer. Anal. 7, 383 12. Inventiones Mathmaticae 7, 382 13. J. Royal Stat. Soc. B 6, 575 14. Mathemat. Programming 6, 444 15. Linear Algebra Appl. 6, 112 Highly Cited Authors in Mathematics for period 1991 -2001 Rank Name Affiliation Department / Field 1. Pierre-Louis Lions University of Paris 9 Mathematics 2. David L. Donoho Stanford University Statistics 3. Adrian F. M. Smith Univ. London Statistics 4. Elizabeth A. Thompson U. Washington Biostatistics 5. Iain M Johnstone Stanford University Statistics 6. Jianqing Fan Chinese U. Hong Kong Statistics 7. Donald B. Rubin Harvard University Statistics 8. Ingrid Daubechies Princeton University Mathematics 9. Adrian E. Raftery U. Washington Statistics/Sociol. 10. Alan E. Gelfand U. Connecticut Statistics 11. Sun-Wei Guo Med. Coll. Wisconsin Biostatistics 12. Scott L. Zeger Johns Hopkins Univ. Biostatistics 13. Peter J. Green University of Bristol Statistics 14. Bradley P. Carlin University of Minnesota Biostatistics 15. J. Stephen Marron U. North Carolina Statistics 16. David G. Clayton MRC, Cambridge Biostatistics 17. Gareth O. Roberts Lancaster Univ. Statistics 18. Albert Cohen University of Paris Mathematics 19. Michael Rockner Univ. Bielefeld, Germany Mathematics 20. Yangbo Ye University of Iowa Mathematics 21. Jinchao Xu Pennsylvania St. U. Mathematics 22. Xiao-Li Meng University of Chicago Statistics 23. Matthew P. Wand Harvard University Biostatistics 24. Wally R. Gilks MRC Biostatistics 25. M. Chris Jones Open University Statistics Citations per paper: Statistics and Biostatistics – 27 Rest of Mathematics - 15 tho u a s atic m e ath tics ! m d cite iostatis t s mo r. B 5 o 2 s p c To tatisti f o 19 m. S o r f are Statistics is Highly Interdisciplinary ! Papers Citations 75 1207 27 1182 40 1026 11 973 17 968 53 901 38 854 20 807 31 804 35 747 6 737 23 723 14 667 28 663 43 618 4 598 41 598 61 572 69 572 42 567 22 566 27 561 31 558 16 551 52 542 rs SOURCE: ISI Essential Science Indicators, Sci. Citation Index (300 Journals in pure mathematics, applied mathematics, statistics and probability) OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY

Statistics and Data Sciences Statistics Disseminates Data Analysis Ideas Accross Science Domains Of 500

Statistics and Data Sciences Statistics Disseminates Data Analysis Ideas Accross Science Domains Of 500 recent citations of Efron’s “Bootstrap” paper, 348 were outside statistics. [NSF 2004] Mitchell’s “Detmax Algorithm” paper 200+ citations (funded by AMR at ORNL) - red are outside statistics. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY George Ostrouchov

Statistics and Data Sciences George Ostrouchov Statistics Core Research Disseminates and Unifies Data Analysis

Statistics and Data Sciences George Ostrouchov Statistics Core Research Disseminates and Unifies Data Analysis Ideas Tames the explosion of data analytic methods by · Providing portability between science domains · Deriving properties of new data analytic methods · Building bridges between data analytic methods Examples: · Latent Semantic Indexing (Dumais+ 1991) and Correspondence Analysis (Benzecri 1969, 1980, 1992, Greenacre 1984) · Empirical Orthogonal Functions (Lorenz 1956) and a climate time series application of Principal Components Analysis (Pearson 1902, Hotelling 1935) · Support Vector Machines (Vapnik 1995) and Logistic Regression (Cox 1970) via hinge loss function (Hastie+ 2001) · Fast. Map approximation to Principal Components (Faloutsos+ 1995): Bridge to Convex Hull and new methods, Robust. Map (Ostrouchov+ 2005) and to right Householder transformations (Ostrouchov+ 2006) e h t OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY y t i g l n a i s n s o i e s r n d e d m A i D f o e s r u C

Statistics and Data Sciences George Ostrouchov Quantitative Rigor for Science: Transfer From Medicine via

Statistics and Data Sciences George Ostrouchov Quantitative Rigor for Science: Transfer From Medicine via Core Statistics to Big Bang Science publication on Big Bang while others still plow through plethora of data “I … emphasize the symbiotic relationship … between the Statisticians and Astrophysicists …. It is now … clear that there are common problems …” Bob Nichol (CMU Physics) Statistics core is the hub that disseminates and unifies data analysis ideas. Critical mass engagement is needed to reap short term and long term returns. Science Applications Miller, CJ; Genovese, C; Nichol, RC; et al. Controlling the false-discovery rate in astrophysical data analysis ASTRONOMICAL JOURNAL, 122 (6): 3492 -3505 DEC 2001 Miller, CJ; Nichol, RC; Batuski, DJ Acoustic oscillations in the early universe and today SCIENCE, 292 (5525): 2302 -2303 JUN 22 2001 Statistics Core Family-wise error rate of statistical tests: One test: 0. 05 probability of a false positive Fifty tests: 0. 93 probability of a false positive need simultaneous inference (SI) Thousand tests: SI too conservative, need FDR False Discovery Rate: OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY “Interdisciplinary” “Decision-making in the face of uncertainty” Source: [NSF 2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century,

Statistics and Data Sciences George Ostrouchov Engage Core Statistics for OASCR Goals · ·

Statistics and Data Sciences George Ostrouchov Engage Core Statistics for OASCR Goals · · A gap exists between statistics research and simulation science Engage statistics with leadership computing Engage statistics with simulation science data Engage statistics with Office of Science experimental data (neutron science) Neutron Science Astrophysics Simulation Science Applications Combustion Simulation Superscalable Algorithms Statistics Core Tuning Leadership Facilities Genome Science Ontologies for Energy Computational Chemistry OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Climate Simulation Fusion Simulation