CSV Comma Separated Values Goals for these videos










- Slides: 10
CSV Comma Separated Values
Goals for these videos Understand the distinction between a schema and a database instance Understand three commonly used file formats
Comma Separated Values Delimited flat file Stores tabular data (numbers and text) in plain text Each line is a record Each record is a list of fields, separated by commas No actual standard except convention.
CSV Edge Cases Fields can be put in double quotes • "josh", "2016" Embedded double quotes must be preceded by an additional double quote • Fields containing an embedded comma character (, ), double quote (") or newline character must be in double quotes • "Nahum, Josh" "Josh said, ""Hi"" to us!" The first line of the file may be a header, which contains the column names. You need contextual information to tell if this is the case.
CSV Example Table Contents CSV Contents To, Subject, Message josh@msu. edu, Sign Up, "Do tyler@msu. edu, """Scare"" they allowed? " it, Do it Quotes", " now" Are To Subject Message josh@msu. edu Sign Up Do it, Do it now tyler@msu. edu "Scare" Quotes Are they allowed?
Well-Formed CSV Which of these lines are well-formed (legal) lines in a CSV file? Josh, Nahum, 48823 Hi Class!, Friday, 2016 ""Stop" he said", Josh New York City, 40° 42'46"N, 74° 00'21"W
CSV Schema 1. 0 Schema defines a textual language which can be used to define the data structure, types and rules for a data format. For instance, we may want to constrict what values are legal in a given column. The CSV format itself is very permissive. So we need a second document to define what constitutes "valid" data. There is an working draft of a CSV schema found here (http: //digitalpreservation. github. io/csvschema/) by the National Archives of the UK.
Example CSV Schema version 1. 0 @total. Columns 3 name: not. Empty age: range(0, 120) gender: is("m") or is("f") Valid CSV Data name, age, gender james, 21, m lauren, 19, f simon, 57, m
Well-Formed versus Valid Well-Formed means the data conforms to the file format (e. g. CSV). Valid means the data conforms to a schema (more restrictive than the format)
Whitespace Do these two lines represent the same record/content? Josh, Nahum, 48823 Yes No Depends