Pat Burns Dean Dawn Paschal Assistant Dean CSU
Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries OPEN DIALOGUE ON DIGITAL DATA MANAGEMENT October 13, 2010 Open Dialogue, Data Mgmt. 1
Background NSF requires proposals submitted as of Jan. 18, 2011 to include plans for data management: http: //www. nsf. gov/pubs/policydocs/pappguide/ nsf 11001/gpg_2. jsp#IIC 2 j � NIH & USDA also have similar requirements � Other agencies looming: ‘Federal Research Public Access Act’ � Maximizing the value of data by sharing � � Discoverability � Access � Preservation � Management October 13, 2010 Open Dialogue, Data Mgmt. 2
Science ‘Then’ (5 -10 years ago) Theory Experiment Computation October 13, 2010 Open Dialogue, Data Mgmt. 3
Science ‘Now’ Data Data Theory Experiment Data Computation Data October 13, 2010 Open Dialogue, Data Mgmt. 4
Science ‘Emerging’ Experiment Theory Computation October 13, 2010 Open Dialogue, Data Mgmt. Data 5
Science 2. 0 ‘Now’? Data Theory Data Computation Data Experiment Data Data October 13, 2010 Open Dialogue, Data Mgmt. 6
Large Digital Data Sets � Satellite imagery can generate > 1 petabyte (1015 bytes) of data per day! � Supercomputers also generate massive data sets � Can we transport them? E. g. , at 10 Gbits per second (note bits, not bytes: 1 byte = 8 bits) � Time = 8 x 1015 bits/(1010 bits/sec) = 8 x 105 secs = 222 hours = 1 week, 2 days, 6 hours, 13 mins � Can we store them? � Requires 500 ea. 2 TByte disks @ $250 ea. = $12, 500; @ 5 year lifetime = $2, 500/yr. � Requires 1 full rack in a data center: space, power, cooling, … October 13, 2010 Open Dialogue, Data Mgmt. 7
Incoming! � An individual researcher can generate many data sets � We have many researchers who generate large data sets � Number: Many x many = Very many! � Size: Very many x Very big = Enormous! October 13, 2010 Open Dialogue, Data Mgmt. 8
Projected Needs (2009 CSU Survey) 5 000 Research Data Storage Needed (TBytes) 4, 914 4 000 3 000 CSU-DR = 3 Tbytes!!! 2 000 1, 384 219 Now October 13, 2010 2 Years 9 5 Years Open Dialogue, Data Mgmt.
How Can We Help? IT Libraries • Data organization & structure • IP issues • Metadata • Discoverability • Preservation Joint Operations Data/Info Stewards • The ‘front end’ • Interactions w/ researchers October 13, 2010 • Storage capacity • Transport capacity • Back-up • Sysadmin • IT security/privacy • Transcoding System Stewards • The ‘back end’ Open Dialogue, Data Mgmt. 10
How Can We Help (cont’d)? � Agreement upon a framework �Draft of a framework, present to faculty � Language for our faculty to include in their proposals � Strategy, policy, procedures �Definition of work flow(s) � Architectures preservation for operations & �Back-up vs. preservation, LOCKSS? October 13, 2010 Open Dialogue, Data Mgmt. 11
Policies: The ‘Front End’ � DRM: IP/ownership issues: data sets not ‘copyrightable’ (not creative works) �But there may be local, institutional IP policies that override this �Note that IP ≠ copyright �Creative Commons or Science Commons licensing may apply �An embargo period is required � What October 13, 2010 are the preservation periods? Open Dialogue, Data Mgmt. 12
NSB Data Type Definitions* Research collections (small, useful to individuals/teams for life of a project, limited curation, standards typically lacking) � Resource collections (medium, useful to a community, follow group’s standards, midto long-term utility) � Reference collections (large, serve many segments of science/engineering, conform to robust standards, indefinite support) � *National Science Board October 13, 2010 Open Dialogue, Data Mgmt. 13
Workflow � Faculty provide data/information �Enter metadata, user’s manuals, select embargo period, select licensing options, enter pubs, point to or supply data sets, … � Librarians manage data/information �Review metadata, ingest and make accessible, review periodically, deaccession periodically (annually? ), manage data, interact w/ faculty � IT staff implement and operate systems �Operate system, backups , security, upgrading storage, transport, move to LOCKSS, etc. October 13, 2010 Open Dialogue, Data Mgmt. 14
Digital Assets - the 4 Pieces � The Metadata, ideally on the CSU-DR 1. Typical, what we collect today, e. g. lightweight metadata (probably not copyrightable) 2. Contextual, e. g. , user’s manuals (yes, copyrightable) Scholarly publications associated with the data – ideally on the CSU-DR 4. The data itself – should be in the most appropriate place (pointers? ) 3. October 13, 2010 Open Dialogue, Data Mgmt. 15
Digital Assets Management 4. Data Sets Small 1. Metadata Libraries-DR 2. User’s Manuals Large 3. Pubs Local Storage “Pointers” October 13, 2010 Medium Open Dialogue, Data Mgmt. Disciplinary Repositories, SC Centers, etc. “The Cloud” 16
Architecture LOCKSS High-speed Networks Primary System The Digital Repository Preservation System October 13, 2010 Open Dialogue, Data Mgmt. 17
CSU Storage Project 45 TBytes (raw) for ~$8 k October 13, 2010 Open Dialogue, Data Mgmt. 18
Strategy for Storage of Data Sets Small, < 100 GB, we would agree to store on the DR, but not forever � Medium, we would agree to store on the DR for a limited time at a cost, or on a local server somewhere and we point to it � Large, stored on a disciplinary DR somewhere, at a supercomputer center, or at a large instrument center � �We point to it (persistent URL? ) � How do we deal with exceptions? October 13, 2010 Open Dialogue, Data Mgmt. 19
What CSUL will Store & at What Cost SIZE PRESERVATION PERIOD (+ means beyond end of grant period) SMALL 0. 1 TB MEDIUM 0. 1 -10 TB LARGE > 10 TB Short (1 yr. +) Free Maybe + Medium (2 yrs. +) Free $500/TB Maybe - Long (> 5 yrs. +) $1, 000/TB No Forever is a long time…. . October 13, 2010 Open Dialogue, Data Mgmt. 20
Needs � � � Libraries-IT partnership Define policies for usage Define practice for usage �Definition of workflows �Operations � Develop needed tools �Build an on-line, self-service submission tool + requirements for review of user-created metadata � Establish systems �Develop preservation infrastructure October 13, 2010 Open Dialogue, Data Mgmt. 21
Issues � Will the DR become a ‘Trusted Digital Repository? ’ �Will this enhance our proposals? � What will be stored where? �Will disciplinary digital repositories emerge, e. g. at NCAR and elsewhere? �Flexibility is key � How best to engage �The VPR (probably already accomplished) �The faculty �Library staff: faculty and operational (DM Librarians at UNM? ) October 13, 2010 Open Dialogue, Data Mgmt. 22
Discussion � Is most welcome. October 13, 2010 Open Dialogue, Data Mgmt. 23
- Slides: 23