SEEING IS BELIEVING Telling stories with statistics in
SEEING IS BELIEVING: Telling stories with statistics – in pictures
We’re failing
Do you see the same thing here? Gender Male Female Military ------- No 943 1, 222 Yes 227 72
This is your brain on statistics Gender Male Female Military ------- No 943 1, 222 Yes 227 72
The total sample is (roughly) evenly divided by gender. Subtracting 72 from the 150 one would expect gives a value of about 80, which squared is 6, 400. It is already obvious this is significant.
Just for closure. . e o (e-o) ^2 ((e-o)^2)/e 157 72 7225 46. 01910828 142 227 7225 50. 88028169 1028 943 7225 7. 028210117 1137 1222 7225 6. 354441513 110. 2820416
Seeing is a learned skill Statisticians may see things in a picture others don’t
My points (surprisingly, I do have some)
Data Visualization • Graphics do not necessarily stand alone
Data visualization is all around us.
Visual representation in one context is often misapplied to another.
Atomic numbers on your socks?
Data visualization needs to ADD information
Basic Assumptions • Our audience needs to be taught to read visual data just as we read numeric data, and we need to learn to have some discussion beyond the choices of line graphs vs. pie charts
You learned to read numbers YOU NEED TO LEARN TO WRITE PICTURES Or, to be more specific, you need to explain to others what you see in pictures
Question + Data > Picture = Story ?
Bad visualization for one question can be good for another • Who will win the election? • Which regions support the Democrats? Poll dataset did not include Hawaii or Alaska
AN EXAMPLE OF PROGRAM EVALUATION DATA VISUALIZATION BY EXAMPLE
The government is smarter than you think (No, I’m serious)
Was the program implemented as planned?
Was the program implemented as planned?
(This was done in JMP)
Did the program work?
GOPTIONS HBY = 2 ; PROC GPLOT DATA=wussexample UNIFORM; PLOT z_total_post * z_total_pre / VREF=0 ; BY group;
NOTE: Regression equation : z_total_post = 0. 13379 + 0. 776552*z_total_pre. NOTE: The above message was for the following BY group: group=CONTROL NOTE: Regression equation : z_total_post = 1. 233616 + 0. 578418*z_total_pre. NOTE: The above message was for the following BY group: group=EXPERIMENTAL EQUATIONS IN THE SAS LOG FOR THE STATISTICIAN IN YOU
Is the intervention successful under all conditions?
Admittedly, we did not train people while flying on a trapeze TRAINING WAS ADMINISTERED TO FOUR COHORTS
Creating the interaction graph First, in the RESULTS window, type sgedit on
Creating the interaction graph First, in the RESULTS window, type sgedit on Ods listing sge = on ; Ods graphics on ; proc glm data = plots ; class Test. Type cohort ; model z_total = Test. Type cohort Test. Type*cohort ; where group = "EXPERIMENTAL" ;
Click on the sge plot to edit it
Of course, that is kind of like being the smaller midget ODDLY, THE MOST TIME-CONSUMING PART OF THIS IS MAKING THE LINES THICKER
Using SGEDIT to, well, edit 1. Double-click on the. sge file in the RESULTS window 2. Right-click in the plot area & select PLOT PROPERTIES 3. Select desired line thickness
Yes, the Test. Type*Cohort*Group interaction (F=5. 84, p <. 0001) AND the Test. Type*Group interaction (F=22. 92, p < 0001) in the other repeated measures ANOVA were significant. THANKS FOR ASKING!
LOOKING AT THE LITTLE PICTURE
Graphs sometimes provide better information than numbers (Especially true for small samples)
or… How SAS ODS GRAPHICS can improve your life
Are these test related? R=. 22
Look!
Another example • Years of Education as predictor of gain score • R-square =. 46 • Correlation =. 68) • P <. 01.
Now looky here … Is it a real relationship?
What should we do? Throw the score out? Keep the score in? Something else?
Ignoring my partner … Compare your answers with the people next to you
Sometimes outliers are the most interesting part of your study
ODS GRAPHICS ON; <some procedure> ODS GRAPHICS OFF;
PROC CORR
One last example on knowing your data Not just telling a story, having a conversation
PROC FREQ
Custom Map-making How to plot the largest category in a frequency distribution
1, 2, 3 1. PROC TABULATE -> output dataset 2. PROC FORMAT 3. Proc GMAP
WHERE IS DEMOCRATIC SUPPORT BASED? DATA VISUALIZATION IN POLITICAL SURVEYS DATA VISUALIZATION BY EXAMPLE
PROC TABULATE DATA= in. VOTE 2008 OUT=Summary. VOTE 2008 ; CLASS question 3 state ; TABLE state, question 3* Row. Pct. N ;
WARNING: Some observations were discarded when charting Pct. N_01. Only first matching observation was used. Use STATISTIC= option for summary statistics.
proc format ; value vote 50. 01 - 100 = "Obama" 0 - 50 = "Mc. Cain" ;
PROC GMAP DATA = Summary. VOTE 2008 map = maps. us ; ID state ; CHORO Pct. N_01 / discrete LEGEND=LEGEND 1 ;
ID statement uses the _map_geometry_ variable that was merged in from the maps. us dataset to identify the location on the map.
PROC GMAP DATA = Summary. VOTE 2008 map = maps. us ; ID state ; CHORO Pct. N_01 / discrete LEGEND=LEGEND 1 ; Pattern 1 c = red ; Pattern 2 c = blue ; format Pct. N_01 vote. ;
PROC GMAP CHORO Pct. N_01 / discrete LEGEND=LEGEND 1 ; FORMAT Pct. N_01 vote. ; CHORO statement uses the first observation and ignores the others.
Does Race Matter?
PROC GMAP Vote 2008 coded 0 = Mc. Cain 1 = Obama Pctmin = Percentage of residents in voter’s district from minority groups
PROC GMAP DATA = wuss map=maps. us ; ID state ; area vote 2008 / discrete statistic = mean ; block pctmin / discrete statistic = mean ; format pctmin rangep. vote 2008 voten. ;
The BLOCK statement charts the pctmin variable. The height of the block will be based on the value of the variable, but the color will be determined using the format specified.
mean minority percentage in districts where Obama voters live is 21% versus 13% for Mc. Cain voters (t= 5. 73, p <. 0001)
The usefulness of visual data With one statement, I can change the percentage of minority & re-run the chart value rangep 0 - 15 = "0 -15%" 15. 01 - 100 = "> 15%%" ;
Decision Trees, ROC & Lift Curves to Predict Military Service DATA VISUALIZATION BY EXAMPLE
Speaking of easy, interactive, graphics JMP
libname readin "E: crimesreadout" ; libname writeout xport "ewuss 2010crimes. xpt" ; proc copy in = readin out =writeout ;
How to get a SAS. xpt file into JMP, Step 1 File > Open
DECISION TREE • • ANALYZE > MODELING > PARTITION SELECT Y SELECT X VARIABLES Click on the SPLIT button
Receiver Operating Characteristic Click on the red arrow at the top left of the partition window for pull-down options include ROC and Lift curves.
ROC • Sensitivity is the percent of true positives, for example, the percentage of people you predicted would die who actually died. • Specificity is the percent of true negatives, for example, the percentage of people you predicted would NOT die who survived.
Comparing models
In JMP, use of training and testing datasets is REALLY easy EXCLUDE 25% or 50% of the data and then re-run your analyses with the excluded sample
A statistician is a person who was good at math but didn’t have enough personality to be an accountant ?
It is important that people believe you And that’s my story
Ann. Maria De Mars The Julia Group 2111 7 th St #8 Santa Monica, CA 90405 ANNMARIA@THEJULIAGROUP. COM (310) 717 -9089
- Slides: 80