The CCLRC Metadata Model Shoaib Sufi Brian Matthews
The CCLRC Metadata Model Shoaib Sufi, Brian Matthews
Overview • • Motivation behind the CSMD A look at the components of the model Usage of the model Future directions Brian Matthews
Who we are (CLRC) • Central Laboratory of the Research Councils • 1700 staff - supporting 12000 scientists and engineers from universities and industry • Based at 3 sites: – Daresbury Laboratory – Rutherford Appleton Laboratory – Chilbolton Observatory • A Multidisciplinary Laboratory Brian Matthews
A Multidisciplinary Laboratory • Spallation Neutron and Muon Source (ISIS) • Synchrotron Radiation Source (SRS) • Lasers • Microstructures • Space Science and Technology • Molecular Spectroscopy • Earth Observation • Atmospheric Science • Computational Science • Energy Research • Information Technology • Particle Physics • Radio Communications • Surfaces Transforms and Interfaces Brian Matthews
The Problem • Scientific institutions generate vast quantities of data – CLRC - ISIS, SRS, Space Science, Particle Physics, Computational Science, . . . • More data coming on stream all the time: – CERN-LHC, Diamond, NGLS, . . . • Very good at handling large amounts of data • Diverse approaches to organising and distributing it. – Data Curation now becoming common Need a usable way of gaining access to the data Brian Matthews
User Scenarios • Lecturer: – This published study would be a good example for teaching, is the raw data publicly available? • Researcher: – This is an interesting paper - can I check the data? • Experiment Proposer: – Have there been any neutron or X-Ray studies of this molecule at 100 K? What reports and papers have been published on them? • Instrument Scientist: – The instrument seems a bit unstable recently, fetch me the results of all calibration runs from the last 3 months? Brian Matthews
The Data Portal Concept • Single method of access to the CLRC data resources • Encompasses a wide range of data holdings – Describes what data is available from the facilities – Links to the data held at the facility – Different archiving methods • Caters for a wide range of users – general community data curators • Supports a wide range of queries – employing data mining, thesauri, …. Brian Matthews
Combine Diverse Users & Searches. . . Excavation Discovery r to ra cu ta Da r te en rim pe Ex r se tu lis ia ec Sp e nc ie sc ity er un id W mm co al ty er ni en u G mm co Brian Matthews
… with Distributed Data Silos…. Facility 1 Facility 2 Facility 3 Facility 4 Brian Matthews
…using a central common metadata index. . . Client http XML wrapper Common metadata catalogue database CLRC Data Access Server XML wrapper Local metadata Local data Facility 1 Brian Matthews
… and a Web based interface • Exploit the existing Web infrastructure. – Use Web Technologies (XML/RDF); – rapidly disseminated; – widely accessible; –database and user platform independent – also being deployed onto the GRID Every user who needs to can get to the information. Brian Matthews
Metadata A generic metadata model for all scientific applications with Specialisation for each domain Science Metadata Model ISIS SRS HEP Social Space Env. Science Can answer questions across domains Can answer questions about specific domains Brian Matthews
Model Motivation • A common general format/standard for Scientific Studies and data holdings metadata does not exist • Metadata workshop at NIEES 2002 during a discussion on metadata standards – are people capturing metadata at the moment? – simple answer given was no !! • By proposing Model and Implementation: – Form a specification for the types of metadata studies should captured by Scientific Studies – Ease citation, collaboration, exploitation and Integration – Allow easy Integration of distributed heterogeneous metadata systems into a homogeneous (albeit virtual) Platform • Therefore – The CCLRC Scientific Metadata Model (CSMD) developed. Brian Matthews
Some Model Aims • Abstract class orientated description of the types of metadata that should be captured by Scientific Studies • Create a denominator for Scientific Study metadata which form a specification • Provide representations in XML, RDF etc. Brian Matthews
What influenced CSMD? • CIP from Earth Observation • DDI from Social Sciences • Dublin. Core from the Library community – Publication only metadata • CERIF – research project information • XSIL as used on LIGO – Low level ‘Scientific Data Objects’ focus • CERA from the MPIM – A bit specific to Earth Sciences but closer • … hence the need to develop out own General Model Brian Matthews
Metadata Model Structure • The CCLRC Scientific metadata model (CSMDM) is a study-data set orientated model holding study information about: – Topic Indexing • Keywords • Taxonomies – Provenance • What the study is, who did it and when – Data Holding • Detailed description about what the data and a hierarchical data organisation • Parameters information • Atomic Data Objects can refer to: – Regular files – Database selects • Some Workflow elements – Legal notes • Copyright, patents and conditions of use etc relating to the study and the data in the study – Related Material • Publications, Community information and related links – (Access Conditions) Metadata Granule Study 1 M Investigation 1 1 Topic Data Holding 1 1 Access Conditions Related Material Data Collection M Atomic Data Object M M 1 Legal Note Brian Matthews
Model Breakdown: Provenance • The Study contains the following metadata: – The Study Name – The Study Institution – The Investigator – Extended Study Information • Abstract • Funding • Start and End times – Investigations Brian Matthews
Investigations • A Study can have more than one investigation; – experiment, – simulation, – measurements etc. • investigations contain: – Name – Investigation Type – Abstract – Resource – Link to Data. Holding Brian Matthews
Topic (for indexing) • Keywords – Discipline (i. e. domain) – Keyword Source (e. g. domain dictionary) – Keyword • Subjects – Discipline – Subject Source (e. g. domain taxonomy) – Subject Brian Matthews
Access Condition & Related Material • Access Conditions – Contains a list of users or groups who are allowed access to the metadata and data, or a pointer to an access control system which contains such data for this study • Related Material – One or many links and or textual descriptions of material related to this study e. g. earlier studies or parallel studies Brian Matthews
Hierarchy of Data Holdings • With investigations, there associated data holdings. • These are themselves arranged in a hierarchy: data sets, and files, with links between them • Logical organisation – identity separated from location. Investigation Data Holding Data-Set 1 (Raw) File 1 name: date: Data-Set 2 (Inter) File 1 name: date: Data-Set 3 (Final) File 1 name: date: Brian Matthews
Data • Data Description holds a logical description of the Study’s data: – Data Name – Type of Data – Status – Data Topic – Parameters – Related Data Ref – Relation type (e. g. derived) • Data Location contains the link between logical name (e. g. URI's) and physical URLs – Data Name – Locator(s) (In the case of Atomic Data Objects these can refer to files as well as named Selects on a database – i. e. virtual data objects) Brian Matthews
More on Parameters • Parameters contain a lot of information about the atomic data objects (ADO) and collections • A collection/ADO can have many parameter entries, each parameter entry contains: • Parameter derivation (e. g. measured/fixed) – The value – The units – Range – Error margin • Parameter aggregation is also supported Brian Matthews
Enumeration Issues • Enumerations (or controlled vocabularies) e. g. types of investigator, types of institutions; these are distinct from the model e. g. as taxonomies are. • However they are necessary for the model to work so implementations e. g. CCLRC Data. Portal XML implementation of the model propose some enumerations for common things • Recognised and relevant controlled vocabularies are hoped to be used by implementations where they are available • Developing a RDF based controlled vocabulary format: SKOS Brian Matthews
Conformance Level • A complete CSMD record requires a large amount of metadata – Divide into conformance levels – Model defines 5 levels – Each level specifies more metadata 1. 2. • L 1 is similar to library/publication 3. style metadata (e. g. Dublin Core) • Current Data. Portal uses somewhere between L 2 and L 3 – 4. has parameter information • Envisaged only new systems designed with CSMD will conform 5. to L 4+ • The higher conformance to the CSMD the richer clients can be – e. g. identifying datasets and atomic data objects which link directly to keywords/taxonomies and not just studies Study and Investigation metadata with indexing at the Study level Level 1 + Data. Holding metadata (i. e. Data. Sets and Data. Objects) Level 2 + related material, Access condition, indexing to data collection levels Level 3 + indexing to data object level and data object parameter information All metadata components are filled as L 4 + funding, resources used, facilities used etc Brian Matthews
Metadata example <? xml version="1. 0" encoding="UTF-8"? > <!DOCTYPE CLRCMetadata SYSTEM "clrcmetadata. dtd"> <CLRCMetadata><Metadata. Record metadata. ID="N 000001"> <Topic> <Discipline>Chemistry</Discipline> <Subject>Crystal Structure</Subject> <Subject>Copper</Subject>. . . <Experiment> <Study. Name>Crystal Structure: Copper : Palladium: : complex: 150 K. . . <Investigator><Name><Surname>Porter. . . <Institution>University of Peebles. . . <Funding>EPSRC. . . <Time. Period><Start. Date><Date>21/04/1999…. <Purpose><Abstract> To study the structure of Copper and Palladium co-ordination complexes at a 150 K. <Data. Manager><Name><Surname>Teat. . . <Instrument>SRS Station 9. 8, BRUKER AXS SMART 1 K. . . <Condition>. . . Wavelength. . . <Units>Angstrom. . . <Param. Value>0. 6890. . . <Condition>…Crystal-to-detector distance<Units>cm. . . <Param. Value>5. 00. . . <Access. Conditions>The user has to be one of: Prof. F. Porter…. Brian Matthews
CCLRC Scientific Metadata Model Usage • Used on many projects – CCLRC Data. Portal • XML Schema Implementation • Serving data from – MPIM (Max-Planck-Institut für Meteorologie, Hamburg) – CCLRC Facilities (ISIS, SR, BADC-test) – NERC funded ‘Environment from the Molecular’ • Mini-Grid Data. Portal Transport Layer – EPSRC funded ‘Simulation of complex materials’ • Mini-Grid Data. Portal Transport Layer Brian Matthews
CSMD Used on Data. Portal • XML Implementation used as Data Interface for Data. Portal • Single view of heterogeneous systems/schemas • Acts as a stress test of the model – Limitations feed into Model Requirements – New requirements fed back into implementation Brian Matthews
Example Result of searching: search across facilities - returns XML to session and displays summary Brian Matthews
Expand Results - give more details from the same XML Brian Matthews
Going Deeper - Can browse the data sets Brian Matthews
Select data - pick the required data files and download from convenient location. Brian Matthews
More Usage • EPSRC funded My. Grid Bio. Informatics project – information model on version 1 of the CSMD Model • This is being taken forward in the EPSRC funded my. IB project – Application to Integrated biology – Extending to provenance tracking in computational steering • ISIS ICAT 20 year back catalog Relational Schema based on version 2 • EPSRC CCP 1(Collaborative Computational Project in Quantum Chemistry) – assessing CSMD for metadata needs on their Grid Data Management Middleware • JISC funded Claddier project – Integration of data and publications to link the outputs Brian Matthews of scientific projects.
Access to Data and Publications • The Data Portal offers the potential to integrate the outputs of scientific research: data and publications. • Need to have a common search mechanism over library and data portals. – Can abstract the science metadata to Dublin Core. – Relate components of the CSMD model to the FRBR model. – Links to CERIF would further deepen connection. – Access to common thesauri for classification. • Common web service interface – Data Portal provides this. – XML Query as a communication mechanism Brian Matthews
The Future • Further Collaboration (e. g Ne. SSi consortium; SNS in the US and JPARC in Japan), Addition of Workflow elements and Integration of more Internal Data repositories: should generate more requirements – May necessitate a core model with extensions • Increased use/recommendation of controlled vocabularies and formal identification systems. • Feeding relevant ideas from other standards • The models is expressed as an Object Model plans to move this to an Ontology – However efficiency reasons still mean Relational and XML Schema based implementations set to be most common – Triple Stores – just not efficient enough (complexity of additions increases for increasing data size as all implicit information calculated) – but an area of development Brian Matthews
Finally • Science metadata model proving very robust • Trying to extend its use into other areas of science – materials science, environmental science. • Broadening range – provenance + computational steering – Links to publication data. • Latest Model description http: //epubs. cclrc. ac. uk/work-details? w=30324 • Above Link contains an XML implementation, for a Relational Implementation e-mail: – dataportal@dl. ac. uk with the subject containing [metadata model request] • Data Portal Project – http: //www. escience. cclrc. ac. uk/web/projects/dataportal Brian Matthews
- Slides: 36