Exploratory Data Analysis Hal Varian 20 March 2006
- Slides: 33
Exploratory Data Analysis Hal Varian 20 March 2006
What is EDA? n Goals n n Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis Methods of analysis n n Primarily graphics and tables Online reference n n http: //www. itl. nist. gov/div 898/handbook/eda. htm http: //www. math. yorku. ca/SCS/Courses/eda/
Tools for EDA n We will use R = open source S n n n Very widely used by statisticians Libraries for all sorts of things are available Download from n n cran. stat. ucla. edu http: //www. r-project. org/ Recommend ESS (=Emacs Speaks Statistics) for interactive use Windows interface is not bad
Interactive R session > > library("foreign") dat <- read. spss("GSS 93 subset. sav") attach(dat) summary(AGE) Min. 1 st Qu. Median Mean 3 rd Qu. 18. 0 33. 0 > hist(AGE) 43. 0 46. 4 59. 0 Max. 99. 0
Histogram of age
Recode missing data n n n AGE[AGE>90] <- NA plot(density(AGE, na. rm=T)) #plot both together hist(AGE, freq=F) lines(density(AGE, na. rm=T))
Density and density + hist
Boxplot n n n n Outlier 1. 5 interquartile range 3 rd quartile Median 1 st quartile Smallest value
Boxplot enhancements n n n Notches: confidence interval for median Varwidth=T: width of box is sqrt(n) Useful for comparisons
Comparing distributions n n boxplot(AGE~RACE) boxplot(AGE~RACE, notch=T, varwidth=T) Doesn’t seem to be big diff in age distn
EDUC v RACE boxplot(EDUC[EDUC<90]~RACE[EDUC<90], notch=T, varwidth=T)
Violin plot n n Combines density plot and boxplot Good for weird shaped distributions…
Back to Back Histogram n n library("Hmisc") histback(EDUC[RACE=="black"], EDUC[R ACE=="white"], probability=T)
Two-way table n n GT 12 <- EDUC>12 temp <-table(GT 12, RACE) n n GT 12 FALSE TRUE white black other 614 100 37 640 67 38 prop. table(temp, 2) n n n GT 12 white black other FALSE 0. 4896332 0. 5988024 0. 4933333 TRUE 0. 5103668 0. 4011976 0. 5066667
Comparing distributions n qqplot = quantile-quantile plot n n n Shapes n n Fraction of data less than k in x Fraction of data less than k in y Straight line: same distribution Vertical intercepts differ: different mean Slopes differ: different variance Reference distribution can be theoretical distn n qnorm – compare to standardized normal Skew to right: both tails below straight line Heavy tails: lower tail above, upper tail below line
qqplot(x, y) examples identical Mean 1=0 Mean 2=2 s 1=1 s 2=2 Sample v N(0, 1), with ref line
More qqnorm examples Skewed to right Heavy tails www. maths. murdoch. edu. au/units/statsnotes/samplestats/qqplot. html
Pairs of variables n n Is one variable related to another? Scatterplot n n n Basic: plot(x, y) Enhanced from library(“car”): scatterplot(x, y) Scatterplot matrix n n Basic: pairs(data. frame(x, y, z)) Enhanced: scatterplot. matrix(data. frame(x, y, z))
Basic and enhanced scatterplot
Scatterplot matrix
Labeling points in scatterplots n n identify(x, y, labels=“foo”) Color is also useful
Cigarettes and taxes n n Discussant on paper by Austan Goolsbee, “Playing with Fire” Question: did Internet purchases of cigarettes affect state tobacco tax revenues?
Cigarette Prices in 1990 s
Internet usage
Price elasticity of use/sales n Across all states and years n n n Taxable sales elasticity: -0. 802 Use elasticity: -0. 440 Sales are much more responsive to price than usage suggesting that there is some cross border trade (aka “buttlegging”)
Use vs Sales in 2000
Reduced form n n dp = log(p 2001) – log(p 1995) dq = log(q 2001) – log(q 1995) Regress dq/dp on internet penetration in 2000 See next slide for result
What is Internet providing? n n It was always a good deal for some to buy cigarettes out-of-state (in high tax states) Mail order has been around for a long time and is certainly cost-effective Internet makes it easier to find merchants – just type into search engine Internet is great at matching buyers and sellers
Price of a match n n Google doesn’t accept cigarette advertisements, but Overture does Price for top listing: $1. 20 per click n n n Avg price for click on Overture is 40 cents Conversion rates might be 5%, so advertiser is paying $24 for introduction But think of lifetime value…
Value of a match n n Google doesn’t accept cigarette advertisements, but Overture does Price for top listing: $1. 20 per click n n n Avg price for click on Overture is 40 cents Conversion rates might be 5%, so advertiser is paying $24 for introduction But think of lifetime value…
Straightening out and scaling data n Find transform so that data looks linear, or normal, or fits on same scale n n Log 10 (easier to interpret than log) Square root Reciprocal Box-Cox transform (xr – 1)/r which combines many of above; r=0 is log
City sizes: regular & log 10
- Poland national anthem lyrics
- Astronomy picture of the day march 29 2006
- Geogrphic
- Exploratory data analysis lecture notes
- Eda definition
- Cara menghitung koefisien variansi
- Apa itu rima
- Ruang lingkup dari layout toko
- Hal-hal yang bisa diobservasi secara audial adalah… *
- Apa yang perlu diperhatikan dalam menulis iklan
- Contoh surat perencanaan pesan bisnis
- Hal-hal yang esensial dalam membuat lagu
- Etika telepon
- Perhatikan hal-hal berikut.
- Perhatikan hal-hal berikut.
- Bab 1 pengantar akuntansi dan perusahaan
- Pindaan minit mesyuarat
- Hal-hal yang perlu diperhatikan dalam kemasan produk adalah
- Gambar
- Instruksi rorschach
- Atribut organisasi komputer
- Pengumuman biasanya memuat hal-hal yang sifatnya
- Exploratory factor analysis
- Clustering in business intelligence
- Research design for secondary data
- Exploratory research secondary data
- Varian
- Frankie varian
- Varian 920 lc
- Tradisi cybernetic
- Kovarians
- Pendeklarasian varian dalam delphi
- Contoh varian adalah .... dan contoh invariant adalah
- Pengertian koefisien variasi