STATA Boot Camp Day 3 Advanced Data Manipulation
STATA Boot Camp Day 3: Advanced Data Manipulation + Summaries Nate Mercaldo Sarah Fletcher Mercaldo Andrew Wiese
Objectives • Advanced Data Management/Manipulation – missing data – operators – a few cool functions to create new variables • Multivariate Statistical Summaries – generate multi-way tables (counts, means, other summaries) – generate multivariate figures (+ modifying figure aesthetics) • Reproducible Research (do files)
Setup • Start log file! • Load birth weight data (birthweight_v 2. dta or birthweight_v 2_11. dta – if not using Stata 14) • Variables – bwt, low = child’s birth weight, indicator of low birth weight – age, race, smoke, height, weight = mother’s age, race, smoking status, height and weight
Exercise 1 • Generate bmi = weight (lb) / height (in)2 * (703) • Summarize bmi – min, max, mean (sd), median (IQR) • Create a histogram of bmi
Exercise 1 variable | min max mean sd p 50 iqr -------+------------------------------bmi |. 0007453 1716. 137 106. 1197 360. 3435 22. 39452 3. 201514 -------------------------------------
Huh? • Anything odd about the generated summaries? • Typically we receive data sets that have NOT been “cleaned”. summarize height weight bmi Variable | Obs Mean Std. Dev. Min Max -------+----------------------------height | 189 590. 7354 2229. 642 61 9999 weight | 189 656. 9788 2213. 965 106 9999 bmi | 189 106. 1197 360. 3435. 0007453 1716. 137
Missing Values • Stata codes missing values as a. • Researchers code missing values as all sorts of things (e. g. , 99, -9, 9, 99, NA, ? ) • Guesses on how missing values were coded in the lbw file? • How can we replace these values with. ?
Aside: Operators • Arithmetic + (addition), - (subtraction), * (multiplication), / (division), ^ (power) • Relational > (greater than), < (less than), >= (greater than or equal), <= (less than or equal), == (equal), != (not equal) • Logical & (and), | (or, pipe – look above enter key), and ! (not)
Missing Values • How can we replace these values with. ? • Use logical operators (or logical expressions)! generate height 2 = height if height < 9999 generate weight 2 = weight if weight < 9999 replace height =. if height==9999 replace weight =. if weight == 9999
Missing Values • Recompute bmi and summarize variable | min max mean sd p 50 iqr -------+------------------------------bmi | 16. 87565 28. 52808 22. 50763 2. 037052 22. 39452 2. 842493 -------------------------------------
Other uses of relational / logical operators • Restrict data to a specific group – list if age <= 15 | age >= 45 (list “at risk” mothers) – drop if smoke == 1 (remove smokers, BEWARE!) • Generating new variables – See Exercise 2 • Compute summaries for a specific group – summarize bmi if smoke==0 • Used A LOT when creating custom tables/figures
Exercise 2 • Categorize BMI Generate a new variable called overweight – overweight, 25 <= bmi <= 30 (note – max bmi = 29 in this dataset) • Summarize birth weight by overweight variable overweight | min max mean sd p 50 iqr ------+------------------------------0 | 709 4990 2964. 701 698. 1659 2977 1035. 5 1 | 1330 3941 2903. 8 882. 4416 2977 1723 ------+------------------------------Total | 709 4990 2959. 598 712. 6656 2977 1063 ------------------------------------
Functions to Generate New Variables • Data > Create or change data > Create new variable (extended) • Categorize continuous variables egen bmicat = cut(bmi), at(0, 18. 5, 25, 30, 100) icodes • Group variables egen racesmoke = group(race smoke) • Create indicator/dummy variables quietly tabulate bmicat, generate(bmicat_) table bmicat; table racesmoke race smoke; table bmicat
Multivariate Summaries • Day 2 – we looked at a lot of univariate (or marginal) summaries • Generally we are more interested in multivariate summaries, say identifying factors associated low birth weight infants. • Using operators to compute summaries (by hand) can be tedious – it would be helpful to have Stata do all the heavy lifting (e. g. , cut command).
Multivariate Tabular Summaries • Possible factors associated with low birth weight infants – age, smoke, bmi (bmicat) • How can we summarize these variables by low? – Continuous: age, bmi [range, mean, sd, quantiles] – Categorical: smoking status and bmicat (frequencies/proportions) • Statistics > Summaries, tables, and tests > – Summary and descriptive statistics – Other tables > Compact table of summary statistics
Multivariate Tabular Summaries • Compact table of summary stats, Options (wide table) tabstat age smoke bmi, statistics( mean sd median iqr) by(low) longstub low stats | age smoke bmi ---------+---------------0 mean | 23. 66154. 3384615 22. 49244 sd | 5. 584522. 4750169 2. 044822 p 50 | 23 0 22. 29633 iqr | 9 1 2. 746094 ---------+---------------1 mean | 22. 30508. 5084746 22. 54381 sd | 4. 511496. 5042195 2. 038627 p 50 | 22 1 22. 48364 iqr | 6 1 2. 973585 ---------+---------------
Multivariate Tabular Summaries • Suppose we are interested in testing to see is an association between smoking and low birth weight – Statistics > Summaries, tables and tests > Frequency tables > two-way tables with measures of association. tabulate smoke low, chi 2 | low smoke | 0 1 | Total ------+-----------+-----0 | 86 29 | 115 1 | 44 30 | 74 ------+-----------+-----Total | 130 59 | 189 Pearson chi 2(1) = 4. 9237 Pr = 0. 026 – Statistics > Summaries, tables and tests > Classical tests of hypotheses
Exercise 3 • Compute the associations (tables and χ2) between smoking and low birth weight by race (hint: command from Day 1? ) -> race = 1 | low smoke | 0 1 | Total ------+-----------+-----0 | 40 4 | 44 1 | 33 19 | 52 ------+-----------+-----Total | 73 23 | 96 Pearson chi 2(1) = 9. 8556 Pr = 0. 002
Multivariate Graphics Box-Plot Scatterplot Default Customized www. ats. ucla. edu/Stat/stata/library/Graph. Examples/default. htm
Examples Stata can make lots of plots – but that does not mean you should! http: //www. surveydesign. com. au/tipsgraphs. html
Multivariate Plots • Type of plot depends on the TYPES of variables – Categorical/categorical • Tables – Categorical/Continuous • Box plots, histograms – Continuous/Continuous • Scatter/bubble plots
Multivariate Plots: Categorical / Continuous – Box Plots • Graphics > Box plot > main variable = continuous, Categories Tab > Group 1 = categorical • graph box bwt, over(smoke) – Histograms • Graphics > Histogram main variable – continuous, By Tab > categorical • histogram bwt, frequency bin(10) by(smoke)
Multivariate Plots: Continuous/ Continuous – Scatter plots (=bubble plots with varying sizes of points) • Graphics > Twoway graph > Create > Basic Plots > Scatter Y variable= continuous, X variable = continuous • twoway scatter bwt age, sort – Other add-ons: lowess smoothers • Graphics > Twoway graph > Create > Advanced Plots > Lowess Line Y variable= continuous, X variable = continuous • twoway (scatter bwt age, sort) (lowess bwt age)
Exercise 4 • Summarize the birth weight by smoking status and race – Create a boxplot of birth weight by smoking status – Create a boxplot of birth weight by race – Create a boxplot of birth weight by smoking status AND race • Summarize maternal age and birth weight (as a group) – Create a scatter plot of age by birth weight – Add smoothers by smoking status (red: smoke=1, black: smoke=0)
Exercise 4
Exercise 4
Can we improve the aesthetics of these plots?
Improving Aesthetics of a Plots are comprised of points/symbols, lines, text, labels, legends, … Stata defaults are fine for preliminary analyses or for homework, but modifications are needed for publications (or reflect personal style) - Provide examples of how to: - Add/modify text: titles, x-/y-axes, legends … Modify plotting symbols: color, size, symbol, … Modify plotting lines: color, width, type, … Modify colors: histograms, box-plots, …
Modifying Aesthetics: Text • Birth weight by maternal race and smoking status
Modifying Aesthetics: Text - CODE • Birth weight by maternal race and smoking status graph box bwt, over(racesmoke) label define rslab 1 "White: NS" 2 "White: S" 3 "Black: NS" 4 "Black: S" 5 "Other: NS" 6 "Other: S" label values racesmoke rslab graph box bwt, over(racesmoke) ytitle("Birth Weight") title("Infant Birth Weight by Maternal Race and Smoking Status") subtitle("subtitle") caption("caption") note("note")
Modifying Aesthetics: Symbols • Birth weight by maternal age and smoking status http: //www. stata. com/manuals 13/g-3 marker_options. pdf
Modifying Aesthetics: Symbols - CODE • Birth weight by maternal age and smoking status twoway scatter bwt age twoway (scatter bwt age if smoke==0, mcolor(black) msize(small) msymbol(diamond)) (scatter bwt age if smoke==1, mcolor(red) msize(large)), legend(order(1 "Non-Smoker" 2 "Smoker")) Note: it is NOT recommended to use all options simultaneously!
Modifying Aesthetics: Lines • Birth weight by maternal age and smoking status http: //www. stata. com/manuals 13/g-3 line_options. pdf
Modifying Aesthetics: Lines- CODE • Birth weight by maternal age and smoking status twoway (scatter bwt age) (lowess bwt age) twoway (scatter bwt age) (lowess bwt age if smoke==0, lcolor(black) lwidth(thin) lpattern(dash)) (lowess bwt age if smoke==1, lcolor(red) lwidth(thick)) Note: it is NOT recommended to use all options simultaneously!
Modifying Aesthetics: Colors • Birth weight by maternal age and smoking status
Modifying Aesthetics: Colors - CODE • Birth weight by maternal age and smoking status histogram bwt, bin(35) frequency fcolor(sandb) lcolor(lavender) lwidth(thick) label define slab 1 “Smoker” 0 “Non-Smoker” label values smoke slab graph box bwt, over(smoke) box(1, fcolor(chocolate) lcolor(pink)) graph box bwt, over(smoke) scheme(s 2 mono)
Reproducible Research • Do file – What is a do file? File that contains all code (w/comments) – Benefits of do file? Record of all data manipulations Record of everything you do to generate an analysis (summary, figure) – How do do files differ from log files? • What if I told you that height was in cm and not inches? How long would it take you to redo all the analysis from today?
- Slides: 37