Readings Levenstein Lyle data sharing Wickham tidy data

  • Slides: 31
Download presentation
Readings: Levenstein & Lyle data sharing Wickham tidy data Git and Github tutorial (recommended)

Readings: Levenstein & Lyle data sharing Wickham tidy data Git and Github tutorial (recommended) PSA data management policies (recommended) Announcement: No drill this week

UNIT 5 Data management

UNIT 5 Data management

ORGANIZING PROJECTS

ORGANIZING PROJECTS

ORGANIZING PROJECTS Minimum codebook standards: 1. Variable name in dataset 2. Variable description (i.

ORGANIZING PROJECTS Minimum codebook standards: 1. Variable name in dataset 2. Variable description (i. e. , how was it derived) 3. Permissible values of variable

ORGANIZING PROJECTS Minimum codebook standards: 1. Variable name in dataset 2. Variable description (i.

ORGANIZING PROJECTS Minimum codebook standards: 1. Variable name in dataset 2. Variable description (i. e. , how was it derived) 3. Permissible values of variable

ORGANIZING PROJECTS Minimum codebook standards: 1. Variable name in dataset 2. Variable description (i.

ORGANIZING PROJECTS Minimum codebook standards: 1. Variable name in dataset 2. Variable description (i. e. , how was it derived) 3. Permissible values of variable Codebook package: https: //psyarxiv. com/5 qc 6 h/

SECURING PROJECTS

SECURING PROJECTS

Functions of security (1) Protect data from unauthorized access (2) Recover lost/corrupted files (3)

Functions of security (1) Protect data from unauthorized access (2) Recover lost/corrupted files (3) Attain version control, or the ability to restore past versions of your writing/code/data Crude version control

Some more comprehensive solutions OSF Dropbox Github

Some more comprehensive solutions OSF Dropbox Github

USEFULLY STRUCTURING DATA

USEFULLY STRUCTURING DATA

Thinking about dataset structure is most helpful with your processed datasets (but sometimes you’ll

Thinking about dataset structure is most helpful with your processed datasets (but sometimes you’ll have some choice about structure for source & raw too) Datasets are useful when they are … (1) Easy to graph (2) Allow easy data exploration (3) Easy to clean (4) Easy to maintain if errors/corrections needed

TIDY DATA Data consists of values measured on an observational unit Each value belongs

TIDY DATA Data consists of values measured on an observational unit Each value belongs to: (1) a variable, or a specific type of measurement (2) a case (Wickham: observation), which contains all variables for an observational unit Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 1: Values used as column labels Fix:

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 1: Values used as column labels Fix:

Violation 1: Values used as column labels Pew data: Frequencies of occurrence of different

Violation 1: Values used as column labels Pew data: Frequencies of occurrence of different income brackets by religion This is a variable

Violation 1: Values used as column labels Pew data: Frequencies of occurrence of different

Violation 1: Values used as column labels Pew data: Frequencies of occurrence of different income brackets by religion gather(d, key=“income”, value=“freq”, -religion) (more not shown …)

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 2: Multiple variables in one column Fix:

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 2: Multiple variables in one column Fix:

Violation 2: Multiple variables in one column Tuberculosis data: Male and female cases across

Violation 2: Multiple variables in one column Tuberculosis data: Male and female cases across different age groups male vs female age brackets (more not shown …)

Violation 2: Multiple variables in one column Tuberculosis data: Male and female cases across

Violation 2: Multiple variables in one column Tuberculosis data: Male and female cases across different age groups (more not shown …) gather(d, key=“column”, value=“cases”, -c(country, year)) (more not shown …) mutate(d, sex =. . . , age =. . . ) (more not shown …)

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 3: Variables stored in rows Fix:

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 3: Variables stored in rows Fix:

Violation 3: Variables stored in rows Weather data: Maximum and minimum temperatures at different

Violation 3: Variables stored in rows Weather data: Maximum and minimum temperatures at different dates and weather stations Two separate variables: (1) Minimum temperature (2) Maximum temperature (more not shown …)

Violation 3: Variables stored in rows Weather data: Maximum and minimum temperatures at different

Violation 3: Variables stored in rows Weather data: Maximum and minimum temperatures at different dates and weather stations (more not shown …) spread(d, key=“element”, value=“value”) (more not shown …)

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 4: Two units in one table Fix:

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 4: Two units in one table Fix:

Violation 4: Two units in one table Billboard data: Weekly rankings of hit songs

Violation 4: Two units in one table Billboard data: Weekly rankings of hit songs Ranking variables Track variables (more not shown …)

Violation 4: Two units in one table Billboard data: Weekly rankings of hit songs

Violation 4: Two units in one table Billboard data: Weekly rankings of hit songs (more not shown …) unique(d[, c(“id”, track_vars)] unique(d[, c(“id”, rank_vars)] (more not shown …)

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 5: One unit in multiple tables Fix:

Data are tidy when: (1) Each variable has its own column (2) Each case

Data are tidy when: (1) Each variable has its own column (2) Each case has its own row (3) Each observational unit has its own table Violation 5: One unit in multiple tables Fix:

Summary You should make codebooks for your datasets. At a minimum, these should contain:

Summary You should make codebooks for your datasets. At a minimum, these should contain: 1. Variable name in dataset 2. Variable description (i. e. , how was it derived) 3. Permissible values of variable Secure your projects using a method that allows for version control One useful structure for datasets is the tidy structure. A dataset is tidy when … 1. Each variable has its own column 2. Each case has its own row 3. Each observational unit has its own table