MANAGING RESEARCH DATA AT MIT GROWING THE CURATION

  • Slides: 59
Download presentation
MANAGING RESEARCH DATA AT MIT: GROWING THE CURATION COMMUNITY ONE INSTITUTION AT A TIME

MANAGING RESEARCH DATA AT MIT: GROWING THE CURATION COMMUNITY ONE INSTITUTION AT A TIME Mac. Kenzie Smith Associate Director for Technology, MIT Libraries Science Commons Research Fellow, Creative Commons December 2010 6 th International Digital Curation Conference ©MIT

Chapter 1 Data Curation as a Metadiscipline December 2010 6 th International Digital Curation

Chapter 1 Data Curation as a Metadiscipline December 2010 6 th International Digital Curation Conference ©MIT

Media Ecology 1960 s: Mc. Luhan defines Media Ecology to describe “how our interaction

Media Ecology 1960 s: Mc. Luhan defines Media Ecology to describe “how our interaction with media facilitates or impedes our chances of survival. The word ecology implies the study of environments: their structure, content, and impact on people. ” Neil Postman, 1971 December 2010 6 th International Digital Curation Conference ©MIT

The Nature of Information (as applied to research data) Information {Data} Has Five Properties

The Nature of Information (as applied to research data) Information {Data} Has Five Properties Cha nge in s Form new understanding Magnitude Information overload Velocity instant feedback Direction Access new relationships Nystrom, C. (1973) “Towards a Science of Media Ecology: The Formulation of Integrated Conceptual. Paradigms for the Study of Human Communication Systems, ” unpublished doctoral dissertation December 2010 6 th International Digital Curation Conference ©MIT

The Rise of Interdisciplinarity “One of the consequences of the change conception of knowledge]

The Rise of Interdisciplinarity “One of the consequences of the change conception of knowledge] [to our modern is a movement away from the rigidly compartmentalized, uncoordinated specialization in scientific inquiry which characterized the Newtonian world, and a movement toward increasing integration of both the physical and the social sciences. ” Nystrom, C. (1973) ibid December 2010 6 th International Digital Curation Conference ©MIT

The Rise of Interdisciplinarity “One of the symptoms of this trend is the proliferation,

The Rise of Interdisciplinarity “One of the symptoms of this trend is the proliferation, in recent years, of “compound” disciplines such as mathematical biochemistry, psychobiology, linguistic anthropology, psycholinguistics, and so on. ” December 2010 6 th International Digital Curation Conference ©MIT

The Rise of Interdisciplinarity Media Ecology as a “Metadiscipline” Is Data Curation driven by

The Rise of Interdisciplinarity Media Ecology as a “Metadiscipline” Is Data Curation driven by interdisciplinarity? Data Ecology as a new metadiscipline? December 2010 6 th International Digital Curation Conference ©MIT

Chapter 2 Goals and Challenges of Data Curation December 2010 6 th International Digital

Chapter 2 Goals and Challenges of Data Curation December 2010 6 th International Digital Curation Conference ©MIT

What are the Goals of Data Curation? To meet our obligation to our research

What are the Goals of Data Curation? To meet our obligation to our research community, our funders, the public � Reproducing results � Reusing data in new contexts (e. g. new tool) � Aggregating data for new research �Compiled/derived databases �Computational sciences �Interdisciplinary research December 2010 6 th International Digital Curation Conference ©MIT

What are the Functions of Data Curation? � Finding it � Making sense of

What are the Functions of Data Curation? � Finding it � Making sense of it � Using it � Aggregating it � Publishing it � Referencing it � Preserving it December 2010 6 th International Digital Curation Conference ©MIT

Finding it: Centralized vs Distributed Collections alexdecarvalho http: //www. flickr. com/photos/adc/ December 2010 6

Finding it: Centralized vs Distributed Collections alexdecarvalho http: //www. flickr. com/photos/adc/ December 2010 6 th International Digital Curation Conference ©MIT

Making sense of it: Provenance � Methodology � Semantics � Authorship � Peer review

Making sense of it: Provenance � Methodology � Semantics � Authorship � Peer review � Changes over time Gap between ability and motivation to provide this December 2010 6 th International Digital Curation Conference ©MIT

Working with it: Tools � Analyze � Visualize � Reproduce results Very little being

Working with it: Tools � Analyze � Visualize � Reproduce results Very little being done to catalog, archive, and provide access to data tools December 2010 6 th International Digital Curation Conference ©MIT

Aggregating it: Encoding Standards � Domain conventions vary FITS in astronomy; CIF in crystallography;

Aggregating it: Encoding Standards � Domain conventions vary FITS in astronomy; CIF in crystallography; EML (XML) in ecology; GO (RDF) in genetics; Difficult to integrate arbitrary data � Web standards (e. g. RDF) as a syntax substrate? � Who maps from domain to generic encoding? December 2010 6 th International Digital Curation Conference ©MIT

Aggregating it: Social and Policy Norms � Intellectual Property Rights �Public Domain or Open

Aggregating it: Social and Policy Norms � Intellectual Property Rights �Public Domain or Open Access vs. �Embargoed or Access Controlled �“right to preserve” principle from Blue Ribbon Task Force � Citation and credit system for data Who sets policy and socializes norms? December 2010 6 th International Digital Curation Conference ©MIT

Aggregating it: Legal Interoperability IPR and data licenses Much data not copyrightable since facts

Aggregating it: Legal Interoperability IPR and data licenses Much data not copyrightable since facts cannot be copyrighted (in the U. S. ) � UK, EU, Australia, other countries have sui generis data rights � Laws not “interoperable” without explicit direction � Big problem for international scientific collaborations and data re-purposing December 2010 6 th International Digital Curation Conference ©MIT

December 2010 6 th International Digital Curation Conference ©MIT

December 2010 6 th International Digital Curation Conference ©MIT

Publishing it: Peer review � Data Papers �citable entity not linked to a “dataset”

Publishing it: Peer review � Data Papers �citable entity not linked to a “dataset” (i. e. file) � Enhanced publications containing or linking to data � Nanopublications These are just beginning to emerge December 2010 6 th International Digital Curation Conference ©MIT

Referencing it: Credit � Data Papers in Wo. S, Scopus, Pub. Med, etc. �

Referencing it: Credit � Data Papers in Wo. S, Scopus, Pub. Med, etc. � DOIs for Data (Data. Cite, Cross. Ref) � The conundrum of handling LOD �Attribution via URI, e. g. ORCID? �How to handle attribution stacking? Requires credit system (e. g. awards, tenure) to notice December 2010 6 th International Digital Curation Conference ©MIT

Preserving it: Curation � Bits (forensics) � Semantics (interoperable) � Pragmatic (immediately useable) Technically

Preserving it: Curation � Bits (forensics) � Semantics (interoperable) � Pragmatic (immediately useable) Technically hard, potentially expensive. Who’s mission? Who pays? December 2010 6 th International Digital Curation Conference ©MIT

Chapter 3 The Data Curation Ecology December 2010 6 th International Digital Curation Conference

Chapter 3 The Data Curation Ecology December 2010 6 th International Digital Curation Conference ©MIT

Curation Ecology: technology view 1. Storage layer i. RODS, S 3, Palimpsest 2. Data

Curation Ecology: technology view 1. Storage layer i. RODS, S 3, Palimpsest 2. Data management layer IRs, ICPSR, UK Data Archive 3. Linking (or Semantic) layer SFX, Semantic Web 4. Discovery layer Google/Google Scholar, ICPSR UI 5. Delivery layer content interaction tools, e. g. Ajax widgets like MIT Exhibit 6. Social layer my. Grid/Taverna, Kepler, VREs, VIVO December 2010 6 th International Digital Curation Conference ©MIT

Curation Ecology: functional view 1. Storage layer 2. Data management layer 3. Linking (or

Curation Ecology: functional view 1. Storage layer 2. Data management layer 3. Linking (or Semantic) layer 4. Discovery layer 5. Delivery layer 6. Social layer 7. Business layer Bit-level persistence Metadata, policies, preservation strategies Identifiers, RDF, ORE encoding Library catalogs, Web search engines, federated search ebook readers, visualization tools, streaming media servers, security and ethics collaboration tools, social networking tools, VLEs and VREs cost recovery, legal/policy frameworks, virtual organizations December 2010 6 th International Digital Curation Conference ©MIT

Curation Ecology: organizational view 1. Research Groups (individual faculty, labs, Labs and Centers) 2.

Curation Ecology: organizational view 1. Research Groups (individual faculty, labs, Labs and Centers) 2. Professional Societies knowledge producers/consumers (social layer) knowledge aggregators (linking layer) 3. Data Centers 4. Libraries and archives 5. Businesses (Publishers, IT companies) 6. Universities, Funders system, data storage expertise (storage, data management layers) content/data management, data linking expertise (data management layer) discovery, delivery layers business, policy layers December 2010 6 th International Digital Curation Conference ©MIT

The Data Curation Ecology? Springer, Nature, BMC, PLo. S, Wo. S APS, ACM, Sage

The Data Curation Ecology? Springer, Nature, BMC, PLo. S, Wo. S APS, ACM, Sage Commons Publishers Scholarly Societies e. g. Microsoft, Oracle Mendeley, Zotero December 2010 Institutions Research Groups IT Companies institutional, disciplinary, commercial Libraries, IT Centers, Research Admin Governments, Foundations Funders Data Centers 6 th International Digital Curation Conference ©MIT

Researcher’s Role: Provision �Metadata � Rights December 2010 (provenance) (licenses, open data) 6 th

Researcher’s Role: Provision �Metadata � Rights December 2010 (provenance) (licenses, open data) 6 th International Digital Curation Conference ©MIT

Society’s Role: Collection e. g. Sage Commons “Sage Commons is a novel information platform

Society’s Role: Collection e. g. Sage Commons “Sage Commons is a novel information platform being built by an international partnership of researchers and stakeholders to define the molecular basis of disease and guide the development of effective human therapeutics and diagnostics. ” “The Sage Commons will be used to integrate diverse molecular mega-data sets, to build predictive bionetworks and to offer advanced tools proven to provide unique new insights into human disease biology. Users will also be contributors that advance the knowledge base and tools through their cumulative participation. ” December 2010 6 th International Digital Curation Conference ©MIT

Publisher’s Role: Accreditation � Require data deposit to archives � Publish data journals �

Publisher’s Role: Accreditation � Require data deposit to archives � Publish data journals � Manage peer review (quality control) � Provide credit for data publishing (evolution of promotion & tenure system) December 2010 6 th International Digital Curation Conference ©MIT

Data Center Role: Infrastructure �HPC �Large-scale storage �Bit-level preservation �Large-scale data operations December 2010

Data Center Role: Infrastructure �HPC �Large-scale storage �Bit-level preservation �Large-scale data operations December 2010 6 th International Digital Curation Conference ©MIT

Funders Role: Policy � Mandates �Incentives � Guidelines December 2010 6 th International Digital

Funders Role: Policy � Mandates �Incentives � Guidelines December 2010 6 th International Digital Curation Conference ©MIT

IT Companies Role: Tools � Analyze and visual data � Search/subset data � Store/manage

IT Companies Role: Tools � Analyze and visual data � Search/subset data � Store/manage data references � Data Integration December 2010 6 th International Digital Curation Conference ©MIT

Library’s Role: Stewardship � Data organization and annotation e. g. ontologies and metadata �

Library’s Role: Stewardship � Data organization and annotation e. g. ontologies and metadata � Data archiving and preservation e. g. perpetual access Outreach and support to local researchers Gabridge, Tracy. The Last Mile: Liaison Roles in Curating Science and Engineering Data. Research Library Issues, no. 265 (Aug 2009). http: //www. arl. org/bm~doc/rli-265 -gabridge. pdf December 2010 6 th International Digital Curation Conference ©MIT

Chapter 4 Library case studies Lesson learned December 2010 6 th International Digital Curation

Chapter 4 Library case studies Lesson learned December 2010 6 th International Digital Curation Conference ©MIT

Institutional/Library Data Curation December 2010 6 th International Digital Curation Conference ©MIT

Institutional/Library Data Curation December 2010 6 th International Digital Curation Conference ©MIT

University of Chicago Case Study December 2010 6 th International Digital Curation Conference ©MIT

University of Chicago Case Study December 2010 6 th International Digital Curation Conference ©MIT

Sloan Digital Sky Survey Data � Managed by the Astrophysical Research Consortium (ARC) �

Sloan Digital Sky Survey Data � Managed by the Astrophysical Research Consortium (ARC) � SDSS Project planned for data hand-off, recruited U of C Library � $700 k budget over 5 years; library costs were ~$150 (storage, servers, processing) December 2010 6 th International Digital Curation Conference ©MIT

Sloan Digital Sky Survey Data �Permanent archive of the data (DAS) �Serving the data

Sloan Digital Sky Survey Data �Permanent archive of the data (DAS) �Serving the data to the public (CAS) �Help desk management �Preserving administrative records December 2010 6 th International Digital Curation Conference ©MIT

Digital Archive Server � FITS images, spectra files, catalog tables �~75 Tb of flat

Digital Archive Server � FITS images, spectra files, catalog tables �~75 Tb of flat files �Support for lookup, download � Preservation plan in development �Chicago and Johns Hopkins are mirror sites �Chicago committed to preserve in perpetuity December 2010 6 th International Digital Curation Conference ©MIT

Catalog Archive Server � Search access to processed data � 20 Tb table data

Catalog Archive Server � Search access to processed data � 20 Tb table data �SQL Server with Web UI � Library hosts, minimal preservation plans �Requires significant domain expertise �Ongoing ARC community effort December 2010 6 th International Digital Curation Conference ©MIT

User Support � Help desk �Referral system to domain expert network �Question categories: CAS,

User Support � Help desk �Referral system to domain expert network �Question categories: CAS, DAS, Photometry, Astrometry, Spectroscopy, Publications permissions and policy, Education exercise �Partially automated, partially intermediated (goal is 100% automation) December 2010 6 th International Digital Curation Conference ©MIT

Records Management � Library’s Special Collections and Records Management �Appraised project records, selected print

Records Management � Library’s Special Collections and Records Management �Appraised project records, selected print and digital project records, including email archives, websites, procedural manuals, documents, databases (e. g. GNATS db) �Transferred, accessioned and processed records using standard archival principles December 2010 6 th International Digital Curation Conference ©MIT

MIT Case Study December 2010 6 th International Digital Curation Conference ©MIT

MIT Case Study December 2010 6 th International Digital Curation Conference ©MIT

Libraries and Data Established curation for some data types statistical (Harvard-MIT Data Center) geospatial

Libraries and Data Established curation for some data types statistical (Harvard-MIT Data Center) geospatial (Geodata Repository) bioinformatics (via NLMNCBI) digital media (e. g. images, videos) general datasets (IR digital archive) December 2010 6 th International Digital Curation Conference ©MIT

December 2010 6 th International Digital Curation Conference ©MIT

December 2010 6 th International Digital Curation Conference ©MIT

Libraries and Data Applies to both faculty-authored and externally-acquired data � Consultation services (in-person,

Libraries and Data Applies to both faculty-authored and externally-acquired data � Consultation services (in-person, via Website) � Liaise with domain data centers (e. g. ICPSR) � Develop (meta)data standards (e. g. DDI) � Manage and preserve data December 2010 6 th International Digital Curation Conference ©MIT

6 th International Digital Curation Conference ©MIT

6 th International Digital Curation Conference ©MIT

Robotics Data in DSpace@MIT The Library: �Defined local taxonomy for metadata values �Customized metadata

Robotics Data in DSpace@MIT The Library: �Defined local taxonomy for metadata values �Customized metadata records �Adapted/simplified deposit workflow �Loaded data from previous repository �Added CC 0 licenses Review of new deposits done by researchers December 2010 6 th International Digital Curation Conference ©MIT

December 2010 6 th International Digital Curation Conference ©MIT

December 2010 6 th International Digital Curation Conference ©MIT

Neuroimaging Case Study December 2010 6 th International Digital Curation Conference ©MIT

Neuroimaging Case Study December 2010 6 th International Digital Curation Conference ©MIT

Neuroimaging Case Study Sources: Brain & Cognitive Science Department; Mc. Govern Institute for Brain

Neuroimaging Case Study Sources: Brain & Cognitive Science Department; Mc. Govern Institute for Brain Research; Martinos Imaging Center; Research Lab of Electronics Digital images (MRIs, DTIs, VBM, etc. ) combined with phenotype and protocol data, genomic data, EEGs, etc. � � Large-scale>10 Tb per year for one group of 4 faculty � Expensive machine) � Hard to find, interpret no standard way to annotate images for sharing, reuse December 2010 each subject ~$1000 (1500/year, per 6 th International Digital Curation Conference ©MIT

Biological Oceanography Case Study Temperature versus salinity (T-S) relations for the North Pacific Subtropical

Biological Oceanography Case Study Temperature versus salinity (T-S) relations for the North Pacific Subtropical Gyre at station ALOHA December 2010 6 th International Digital Curation Conference ©MIT

Biological Oceanography Case Study Sources: Civil Engineering; Biological Engineering; Earth, Atmospheric and Planetary Sciences

Biological Oceanography Case Study Sources: Civil Engineering; Biological Engineering; Earth, Atmospheric and Planetary Sciences � Metagenomics data combined with biochemical sensor data (water chemistry, optical properties, physical data (e. g. location) � Large-scale. Solexa sequencer produces 1 Tb per run X 2 -3 runs/week � Irreplaceable time dependent, not fully analyzable today � Need to collaborate no integrated DB exists (e. g. Gen. Bank only takes sequences) December 2010 6 th International Digital Curation Conference ©MIT

Chapter 5 The Data Ecology Revisited December 2010 6 th International Digital Curation Conference

Chapter 5 The Data Ecology Revisited December 2010 6 th International Digital Curation Conference ©MIT

The Data Curation Ecology? Springer, Nature, BMC, PLo. S, Wo. S APS, ACM, Sage

The Data Curation Ecology? Springer, Nature, BMC, PLo. S, Wo. S APS, ACM, Sage Commons Publishers Scholarly Societies e. g. Microsoft, Oracle Mendeley, Zotero December 2010 Institutions Research Groups IT Companies institutional, disciplinary, commercial Libraries, IT Centers, Research Admin Governments, Foundations Funders Data Centers 6 th International Digital Curation Conference ©MIT

The Institution’s Role in the Data Ecology � Policy � Incentives � Financial December

The Institution’s Role in the Data Ecology � Policy � Incentives � Financial December 2010 sustainability 6 th International Digital Curation Conference ©MIT

The Library’s Role in the Data Ecology � Key role in �defining the service

The Library’s Role in the Data Ecology � Key role in �defining the service model for data archiving �Providing outreach and support to researchers �Modeling data and metadata “ontologies” �Preserving data over long time frames � In collaborationwith �central IT department to manage system, storage, technical support; �domain experts for metadata and preservation goals December 2010 6 th International Digital Curation Conference ©MIT

The Library’s Role in the Data Ecology Provide Outreach, Support, Education • • Funder/university

The Library’s Role in the Data Ecology Provide Outreach, Support, Education • • Funder/university policy, legal issues File management (naming, backup, directories) Annotation (metadata, research protocol) Long-term archiving (digital preservation) Data sharing and citation Data integration support Data Management Plans December 2010 6 th International Digital Curation Conference ©MIT

The Library’s Role in the Data Ecology Why libraries? �Libraries are interdisciplinary, large-scale �Libraries

The Library’s Role in the Data Ecology Why libraries? �Libraries are interdisciplinary, large-scale �Libraries know information �Libraries like supporting researchers Librarians are a keystone species in information ecologies Nardi, Bonnie; O’Day, V. (1999). Information Ecology: Using Technology with Heart. Cambridge: MIT Press. pp. 288. http: //firstmonday. org/htbin/cgiwrap/bin/ojs/index. php/fm/article/view/672/582 December 2010 6 th International Digital Curation Conference ©MIT

Lessons Learned (The End) � Embrace interdisciplinarity � Examine mission, strength of each “species”

Lessons Learned (The End) � Embrace interdisciplinarity � Examine mission, strength of each “species” organization (in addition to roles, skill sets) Work towards a shared definition of the data curation ecology December 2010 6 th International Digital Curation Conference ©MIT