Multivariate statistics in Lecture 2 Working in RStudio

  • Slides: 33
Download presentation
Multivariate statistics in Lecture 2 Working in RStudio Data Visualization

Multivariate statistics in Lecture 2 Working in RStudio Data Visualization

Plans for today 1. 2. 3. 4. 5. Quick practical announcements Check homework assignments

Plans for today 1. 2. 3. 4. 5. Quick practical announcements Check homework assignments Working in RStudio tutorial Data analysis workflow Data visualization and exploration

Quick comments about the syllabus ● ● Exact dates are now in the syllabus

Quick comments about the syllabus ● ● Exact dates are now in the syllabus The material got a bit slimmer The midterm exam will be on Nov 17 th, usual time and place If you have questions or need additional consultation, let me know. I also have a weekly consultation on Tuesday mornings (10: 30 -12) ● There is an optional class on 22 th of December, that we can use entirely for consultation ● Final exam will be decided later, but there will be options for January

Working in RStudio

Working in RStudio

Plots, environment Console Help, history, files

Plots, environment Console Help, history, files

Script Console Plots, environment Help, history, files

Script Console Plots, environment Help, history, files

How to run a command Script area, type the command(s), and hit Ctrl+Enter You

How to run a command Script area, type the command(s), and hit Ctrl+Enter You can also select several lines of code to run You can also click on the run icon, or run through the menu (Code) Console, just type the command, and hit Enter All textual results will appear here

Your turn (10 min) ● ● ● Create a new script Create a vector

Your turn (10 min) ● ● ● Create a new script Create a vector (x) of numbers from 127 to 987 Calculate the mean of these numbers and store it in a variable (x_mean) Subtract the mean from each element of x and print the result Run all lines simultaneously

Making sense of data

Making sense of data

Layers of data analysis Aim: Drawing conclusions from your data 3 + 1 levels

Layers of data analysis Aim: Drawing conclusions from your data 3 + 1 levels of data analysis: 1. Exploration: descriptives 2. Model (hypothesis) testing: confirmatory analysis 3. Interpretation/conclusion: explanation based on domain specific knowledge 4. (Recommendation): prediction

Data analysis workflow

Data analysis workflow

Data visualization (Data viz)

Data visualization (Data viz)

Data visualization can help you to. . . ● ● … understand your data

Data visualization can help you to. . . ● ● … understand your data better … notice errors in the data … can give you new ideas … will help your audience/readers see patterns BUT ● Should tell you a story and not complicate it further ● Should not bias your judgement with tricks

What’s the problem with this plot?

What’s the problem with this plot?

Showing how confident our result is by adding standard error

Showing how confident our result is by adding standard error

We can even add our raw data points

We can even add our raw data points

What’s wrong with the following plot?

What’s wrong with the following plot?

Source: http: //www. politifact. com/truth-ometer/statements/2015/oct/01/jason-chaffetz/chartshown-planned-parenthood-hearing-misleading-/

Source: http: //www. politifact. com/truth-ometer/statements/2015/oct/01/jason-chaffetz/chartshown-planned-parenthood-hearing-misleading-/

Correct scale Source: http: //www. politifact. com/truth-ometer/statements/2015/oct/01/jason-chaffetz/chartshown-planned-parenthood-hearing-misleading-/

Correct scale Source: http: //www. politifact. com/truth-ometer/statements/2015/oct/01/jason-chaffetz/chartshown-planned-parenthood-hearing-misleading-/

Correct scale and add important information Source: http: //www. politifact. com/truth-ometer/statements/2015/oct/01/jason-chaffetz/chartshown-planned-parenthood-hearing-misleading-/

Correct scale and add important information Source: http: //www. politifact. com/truth-ometer/statements/2015/oct/01/jason-chaffetz/chartshown-planned-parenthood-hearing-misleading-/

What’s the problem with this viz?

What’s the problem with this viz?

Make scale proportional

Make scale proportional

Exploring the movies dataset ● Download the data from OSF (datasets folder) ● Access

Exploring the movies dataset ● Download the data from OSF (datasets folder) ● Access the data from the internet directly from R: load(url("https: //stat. duke. edu/~mc 301/data/movies. Rdata")) ● Question to explore together: How much do the critics and the audience agree on movie ratings?

Exercise on your own (10 min) Use this site if you need help: http:

Exercise on your own (10 min) Use this site if you need help: http: //ggplot 2. tidyverse. org/reference/ Answer to as many questions as you can, using scatter plots: ● Option 1: Is there any yearly seasonality in movie genres? Are there movie genres that are not featured in certain months? ● Option 2: Do films get longer on average over time? Also highlight the films that are longer than 3 hours! ● Option 3: Is there an association between the number of votes and rating? Also check if title type is relevant in this question!

The layered “grammar of graphics” ggplot(data = <DATA>) + aes(<MAPPINGS>) + <GEOM_FUNCTION>( mapping =

The layered “grammar of graphics” ggplot(data = <DATA>) + aes(<MAPPINGS>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION> ) + <FACET_FUNCTION> + <COORDINATE_FUNCTION> + <THEME> +

Aesthetic mapping What would you like to show on the plot? These can depend

Aesthetic mapping What would you like to show on the plot? These can depend on data, or you can add constants. ● ● Position: x, y, z, lat, long, ymin, ymax, xmin, xmax, . . . Sizing: size, width, height, . . . Shape: shape Coloring: color, fill, alpha You can specify any aesthetic, but it will only be added to the plot if it is relevant to the geoms.

Geoms are different visualization objects geom Description geom_point() Points geom_line() Lines geom_ribbon() Ribbons, y

Geoms are different visualization objects geom Description geom_point() Points geom_line() Lines geom_ribbon() Ribbons, y range with continuous x values geom_polygon() Polygon, a filled path geom_pointrange() Vertical line with a point in the middle geom_linerange() An interval represented by a vertical line geom_path() Connect observations in original order geom_histogram() Histograms geom_text() Text annotations geom_violin() Violin plot (another name for a beanplot) geom_map() Polygons on a map Source: http: //seananderson. ca/ggplot 2 -FISH 554/

Understanding the boxplot

Understanding the boxplot

Statistical transformations This happens automatically, if you set the aesthetics and the geoms right.

Statistical transformations This happens automatically, if you set the aesthetics and the geoms right. ● geom_bar() ● geom_smooth() ● geom_boxplot()

Position refers to data that is grouped Can be: ● ● ● position. .

Position refers to data that is grouped Can be: ● ● ● position. . . = = “dodge” “stack” “fill” “jitter”

Coordinate systems control the axes ● ● coord_flip() coord_polar() coord_map() coord_carthesian(ylim = c(1, 100),

Coordinate systems control the axes ● ● coord_flip() coord_polar() coord_map() coord_carthesian(ylim = c(1, 100), xlim = c(0, 50))

Themes are preset designs There are several themes other than the default Try adding

Themes are preset designs There are several themes other than the default Try adding any these to your plots, and see what happens ● theme_bw() ● theme_light() ● theme_minimal()

Homework Read and try r 4 ds data visualization chapter, and complete at least

Homework Read and try r 4 ds data visualization chapter, and complete at least two tasks from each exercise group: ● ● ● 3. 3. 1 3. 5. 1 3. 6. 1 3. 7. 1 3. 8. 1 3. 9. 1