R An Open Source Statistical Environment Valentin Todorov
























- Slides: 24
R: An Open Source Statistical Environment Valentin Todorov UNIDO v. todorov@unido. org MSIS 2008 (Luxembourg, 7 -9 April 2008) 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 1
R: An Open Source Statistical Environment Outline • • • 8. 4. 2008 Introduction: the R Platform and Availability R Learning Curve (is R hard to learn) R Extensibility (R Packages) R and the others (Interfaces) R Graphics R for Time series R for Survey Analysis R and the Outliers (Robust Statistics in R) More R features (WEB, Missing data, OOP, GUI) Summary and Conclusions MSIS 2008, Luxembourg: Valentin Todorov 2
R: An Open Source Statistical Environment What is R • R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities” • Developed after the S language and environment – S was developed at Bell Labs (John Chambers et al. ) – S-Plus: a value added implementation of the S language- Insightful Corporation – much code written for S runs unaltered under R • Significantly influenced by Scheme, a Lisp dialect 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 3
R: An Open Source Statistical Environment What is R • Ihaka and Gentleman, University of Auckland (New Zealand) – 1993 a preliminary version of R – 1995 released under the GNU Public License – Now: R-core team consisting of 17 members including John Chambers • R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques • R is available as Free Software under the terms of the GNU General Public License (GPL). 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 4
R: An Open Source Statistical Environment R Extensibility (R Packages) • One of the most important features of R is its extensibility by creating packages of functions and data. • The R package system provides a framework for developing, documenting, and testing extension code. • Packages can include R code, documentation, data and foreign code written in C or Fortran. • Packages are distributed through the CRAN repository – http: //cran. r-project. org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions. 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 5
R: An Open Source Statistical Environment R and the Others (R Interfaces) • Reading and writing data (text files, XML, spreadsheet like data, e. g. Excel • Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign. • Emulation of Matlab – package matlab. • Communication with RDBMS – ROracle, RMy. Sql, RSQLite, Rm. SQL, RPg. SQL, RODBC – large data sets, concurrency • Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets • Can use compiled native code in C, C++, Fortran, Java 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 6
R: An Open Source Statistical Environment R Graphics • One of the most important strengths of R – simple exploratory graphics as well-designed publication quality plots. • The graphics can include mathematical symbols and formulae where needed. • Can produce graphics in many formats: – – 8. 4. 2008 On screen PS and PDF for including in La. Tex and pdf. La. Te. X or for distribution PNG or JPEG for the Web On Windows, metafiles for Word, Power. Point, etc. MSIS 2008, Luxembourg: Valentin Todorov 7
R: An Open Source Statistical Environment R Graphics: basic and multipanel plots (trellis) 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 8
R: An Open Source Statistical Environment R Graphics: parallel plot and coplot 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 9
R: An Open Source Statistical Environment R for Time Series • Package stats – classical time series modeling tools – arima() for Box-Jenkins type analysis – structural time series – Struct. TS() – filtering and decomposition – decompose() and Holt. Winters() • Package forecast – additional forecast methods and graphical tools • Analyzing monthly or lower frequency time series: – TRAMO/SEATS – X-12 -ARIMA Þ accessible through the Gretl library • Task View Econometrics: http: //cran. r-project. org/web/views/Econometrics. html 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 10
R: An Open Source Statistical Environment R for Time Series: Example • Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 11
R: An Open Source Statistical Environment R for Survey Analysis • Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc. • STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 12
R: An Open Source Statistical Environment R for Survey Analysis • R – package survey - http: //faculty. washington. edu/tlumley/survey/ – stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements – Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains – Variances by Taylor linearization or by replicate weights (BRR, jack -knife, bootstrap, or user-supplied) – Graphics: histograms, hexbin scatterplots, smoothers • Other packages: pps, sampling, sampfling 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 13
R: An Open Source Statistical Environment R and the Outliers (Robust Statistics in R) • What are Outliers – atypical observations which are inconsistent with the rest of the data or deviate from the postulated model – may arise through contamination, errors in data gathering, or misspecification of the model. – classical statistical methods are very sensitive to such data • What are Robust methods – Produce reasonable results even when one or more outliers may appear in the data – Robust regression - robustbase – Robust multivariate methods – rrcov, robustbase – Robust time series analysis - robust-ts 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 14
R: An Open Source Statistical Environment R and the Outliers: Example • Example: Wages and Hours - http: //lib. stat. cmu. edu/DASL/ – a national sample of 6000 households with a male head earning less than $15, 000 annually in 1966 - 9 independent variables; classified into 39 demographic groups – estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average of the respondents: – We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 15
R: An Open Source Statistical Environment R and the Outliers: Example OLS 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 16
R: An Open Source Statistical Environment R and the Outliers: Example LTS 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 17
R: An Open Source Statistical Environment R and the Outliers: Example Covariance • Marona & Yohai (1998) • rrcov: data set maryo • A bivariate data set with: • sample correlation: 0. 81 • interchange the largest and smallest value in the first coordinate • the sample correlation becomes 0. 05 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 18
R: An Open Source Statistical Environment More R… • R and the WEB - several projects that provide possibilities to use R over the WEB • R and the Missing – advanced missing value handling – – – mvnmle: ML estimation for multivariate data with missing values mitools: Tools for multiple imputation of missing data mice - Multivariate Imputation by Chained Equations EMV: Estimation of Missing Values for a Data Matrix VIM: provides methods for the visualisation as well as imputation of missing data • R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#) 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 19
R: An Open Source Statistical Environment More R… • R GUI – R Commander: a basic statistics GUI, consisting of a window containing several menus, buttons, and information fields – Sciviews: a suite of companion applications for Windows • R and SDMX • R Reports – package xtable: coerce data to La. Te. X and HTML tables – package Sweave: a framework for mixing text and R code for automatic report gene 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 20
R: An Open Source Statistical Environment Summary • Output Management System – SAS/SPSS: it is rarely used for routine work – R: output is easily passed from one function to another to do further processing and to obtain more results • Macro Language – SAS/SPSS: a special language with own syntax. The new functions are not run in the same way as the built-in procedures – R itself is a programming language • Matrix Language – SAS/SPSS: A special language with own syntax – R is a vector and matrix based language complemented by additional packages: Matitrx, Sparse. M 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 21
R: An Open Source Statistical Environment Summary (cont. ) • Publishing results – SAS/SPSS: Cut and paste to a Word processor or exporting to a file – R: produce La. Tex output (including graphics) using for example the Sweave package • Data size – SAS/SPSS: Limited by the size of the disk – R: Limited by the size of the RAM, (not trivial) usage of databases for large data sets is possible • Data structure – SAS/SPSS: Rectangular data set – R: Rectangular data frame, vector, list 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 22
R: An Open Source Statistical Environment Summary (cont. ) • Interface to other programming languages – SAS/SPSS: Not available – R: R can be easily mixed with Fortran, C, C++ and Java • Source code – SAS/SPSS: Not available – R: the source code of R itself as well as of its packages is a part of the distribution 8. 4. 2008 MSIS 2008, Luxembourg: Valentin Todorov 23
R: An Open Source Statistical Environment References • • 8. 4. 2008 Hornik, K and Leisch, F, (2005) R Version 2. 1. 0, Computational Statistics, 20 2 pp 197 -202 Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http: //www. statmethods. net/index. html López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066 -1081 Muenchen, R. (2007), R for SAS and SPSS users, URL: http: //oit. utk. edu/scc/Rfor. SAS&SPSSusers. pdf Murrel, P. (2005) R Graphics, Chapman & Hall R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3 -900051 -07 -0. URL: http: //www. r-project. org/ Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication Wheeler, D. A. , (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers! MSIS 2008, Luxembourg: Valentin Todorov 24