Tidy Data Creating better wellformatted spreadsheets Instructor Karalyn
Tidy Data: Creating better, well-formatted spreadsheets Instructor: Karalyn Ostler
Agenda > > > Using spreadsheets for data Formatting Spreadsheets Common Problems Using sorting to find errors Exporting Data
Setup > Does everyone have the training_attendance. xlsx spreadsheet?
About the workshops > MLIS Capstone Project > Based on the Library Carpentries Lessons
Uses for Spreadsheets in Libraries > > > Data entry Organizing data Subsetting and sorting data Statistics Plotting
What we will cover today > Good data entry practices - formatting data tables in spreadsheets > How to avoid common formatting mistakes > Sorting to find mistakes in spreadsheets > Exporting data from spreadsheets
What we will NOT cover today > How to do statistics in a spreadsheet > How to do plotting in a spreadsheet > How to write code in spreadsheet programs
Activity > How many people have used spreadsheets in their work? > What kind of operations do you do in spreadsheets? > Which ones do you think spreadsheets are good for? > What are some things that you’ve accidentally done in a spreadsheet, or have been frustrated that you can’t do easily?
Questions so far?
Structuring data in Spreadsheets
Cardinal rules for using spreadsheet programs for data: 1. 2. 3. 4. Leave the raw data raw - don’t mess with it! Put all your variables in columns Put each observation in its own row. Don’t combine multiple pieces of information in one cell. 5. Export the cleaned data to a text-based format like CSV.
Leave the raw data raw > Good: Create a new tab in Spreadsheet program > Better: Create a copy of a file – 20200504_RDMclass_registration_rawdata. csv – 20200504_RDMclass_registration_cleandata_1. csv
Keep track of your analyses > Create a new file or tab > Keep track of the steps you took in your clean up or analysis
Put all your variables in columns
Put all your variables in columns
Put each observation in its own row
Don’t combine multiple pieces of information in one cell
Activity: Breakout Rooms 1. 2. 3. 4. Open up the data in a spreadsheet program You can see that there are three tabs. Various people have recorded training attendance statistics over 2016 and 2017, and they have kept track of the data in their own way. Now you’re being asked to evaluate the training program and you want to be able to start doing statistics with the data. In your breakout rooms, identify potential problems in the 2016 and 2017 tabs. What changes need to be made so that a computer will be able to understand the data? Take note of these problems and changes. After you go through this exercise, we’ll discuss as a group what you think was wrong with this data and how you would fix it. Important: Do not forget of our first piece of advice: create a new file (or tab) for the cleaned data, and never modify the original (raw) data.
Questions?
Common Spreadsheet Errors (there is a lot of them…)
Multiple tables > Confuses computers!
Multiple tabs 1. Likely to accidentally add inconsistencies to your data 2. Add an extra step for yourself before you analyze the data because you will have to combine these data into a single datatable. 3. Confuses the computer!
From White et al, 2013, Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution: Not filling in zeroes or using bad null values
Using formatting to convey information Before After
Using formatting to make the data sheet look pretty > Avoid: – Merging Cells – Special Characters – Stylistic coloring
More than one piece of information in a cell
Field name problems (column names) Good Name Good Alternative Avoid Max_temp_C Max. Temp Maximum Temp (°C) Precipitation_mm Precipitation precmm cell_type Cell. Type Cell Type Observation_01 first_observation 1 st Obs
Placing comments or units in cells > Rule of thumb: Avoid any extra information or formatting in cells > Put comments and notes on units in metadata or README file > Either a number value or a single descriptor
Inclusion of metadata in data table > Metadata should be in a separate file – README files > Possible Metadata – – Legends Column descriptions Comments Encoding comments
Questions?
Activity: Taking another look > Let’s look at our example spreadsheet again. > What problems do you see that we have not addressed? > What changes should be made to tidy the data? • • • Multiple tables Multiple tabs Not filling in zeros Using bad null values Using formatting to convey information Using formatting to make the data sheet look pretty • • • Placing comments or units in cells More than one piece of information in a cell Field name problems Special characters in data Inclusion of metadata in data table
Questions?
Basic Steps for Cleaning and Organizing Data
Sorting > Bad values often sort to bottom or top of the column. > If your dataset is well-structured and does not contain formulas, sorting should never affect the integrity of your dataset.
Activity: Sorting by column > Let’s try this with the Date tab in our messy spreadsheet. Go to that tab. Select Data then select Sort > Sort by date in the order Smallest to Largest or Oldest to Newest
Activity: Sorting by column > When you do this sort, do you notice anything strange? > Try sorting by the num_registered column. Anything strange there?
Questions?
Exporting data and popular data and file formats
Choosing a format for storing your data > Excel: . xls or. xlsx – Proprietary – Not interoperable – Varying software versions > Tab-delimited (. tsv) or Comma-Separated Value (. csv) – Open – Interoperable – Static
Saving an Excel file as a CSV file 1. From the top menu select ‘File’ and ‘Save as’. 2. In the ‘Format’ field, from the list, select ‘Comma Separated Values’ (*. csv). 3. Double check the file name and the location where you want to save it and hit ‘Save’.
Beware of Commas! CSV File looks like this: species_id, genus, species, taxa AB, Amphispiza, bilineata, Bird AH, Ammospermophilus, harrisi, Rodentnot, censused AS, Ammodramus, savannarum, Bird Excel looks like this:
Solution: data with commas > Put values in “quotes” – Works for small datasets – Potentially messy later > Use another character – Tab Separated Values (. tsv) – Literally any character
The END
Any Final Questions? > Open Refine Workshop – May 14 th 10: 30 am-12 pm – Clean Messy spreadsheets – Quick statistics > Other Library Carpentries Lessons: – https: //librarycarpentry. org/lessons/ > Full Tidy Data Lesson > Regular Expressions > Introduction to Git
Workshop Feedback Survey > Please fill out: https: //forms. gle/vmh. Wk. Hfmhstd 53 Yb 6 > All Workshop Materials can be found at: https: //osf. io/ct 82 p/
- Slides: 45