Information Standards for describing processing and disseminating data
Information Standards for describing, processing and disseminating data and metadata Marco Pellegrino Eurostat, Unit B 1: Methodology and corporate architecture marco. pellegrino@ec. europa. eu ESTP course Luxembourg, April 2015 Eurostat
Purpose of this training course • At the end of this session you will: • Know the basics of the most used information models • Be aware of the SDMX business case, and the main steps for implementing SDMX in a statistical domain • Understand the techniques to identify the structure of data and metadata in SDMX • Understand the way other standards (DDI, RDF, …) can support various use cases • Understand how the different standards can fit together 2 Eurostat
Agenda 0. Opening and Introduction 1. Introduction to Information Models & Standards 2. Overview of implementation-level standards: SDMX - SDMX model and components - Data and metadata structures, Web services, formats 3. Overview of implementation-level standards: DDI, RDF, VTL, BPMN 4. Conclusion - Use cases, Open Forum - Course evaluation Eurostat
Challenges 4 Eurostat
Where are we? • Dramatic changes in the environment of official statistics producers (e. g. data deluge) • Modernization of statistical information system seen as a question of survival for the sector of official statistics • Standardization viewed as a key enabler for modernization • "Standards-based” industrialization of statistical production 5 Eurostat
Strategic responses ü High-level group on the modernisation of statistical production and Services (HLG) ü ü ü Generic Statistical Business Process Model (GSBPM) Generic Statistical Information Model (GSIM) Common Statistical Production Architecture (CSPA) ü Sponsorship on Standardisation (Sp. S) ü ü ESS process of standardisation Cost/Benefit model for evaluating standardisation efforts ü ESS Vision 2020 Implementation ü Contains a series of cross-cutting activities aiming at the deployment of new and re-engineered business processes and related standards ensuring the interoperability of processes and the sharing of information 6 Eurostat
Standardization • Why is it necessary? • Harmonization • Reusability and interoperability • Shared solutions across statistical institutes • What does it imply? • Common processes • Common tools • Common methodologies 7 Eurostat
The standardisation process 8 Eurostat
Standardization • Industry Standards • • GSBPM - Generic Statistical Business Process Model GSBP GSIM - Generic Statistical Information Model M SDMX - Statistical Data and Metadata e. Xchange DDI - Data Documentation Initiative • Other major standards • • GSIM RDF - Resource Description Framework LOD - Linked Open Data JSON - Java. Script Object Notation XBRL - e. Xtensible Business Reporting Language SDMX DDI 9 Eurostat
Usual complaints against most standards • Standards are slow • Designed by a committee • It’s just a bunch of vendor experts • Standards don’t have my feature XYZ • Standards don’t guarantee portability • Standards don’t innovate • We are different ! 10 Eurostat
Why do we need a model? • To define and describe statistical processes in a coherent way • To standardize process terminology • To compare and benchmark processes within and between organisations • To identify synergies between processes • To inform decisions on systems architectures and organisation of resources 11 Eurostat
Information model Ø Coherent set of information objects/concepts: e. g. statistical unit, variable, value, question, classification ü With their definitions ü With relations to other objects ü With attributes ü Used to describe statistical data and metadata 12 Eurostat
SDMX Information Model Provides a way of modelling data, metadata and exchange processes Dimensions (ex: country, variable/topic, year) Data Structure Definition (DSD) Dataset Structure Code lists Attributes (ex: unit of measure) Structural Metadata Identify/Describe Metadata about an individual value, a time series or a group of time series Data 13 Eurostat
A user level formal language to: • • • express, agree and design information needs give specifications to reporting agents communicate with IT people drive the software (which doesn’t change) document the system User autonomy Flexible information system, evolving fast & cheaply 14 Eurostat
The four metamodelling levels SDMX metamodel Data model: concepts, codes, DSD Real data (e. g. BOP, ESA/SNA) A model represents a system and conforms to a metamodel 15 Eurostat
Generic Statistical Information Model (GSIM) SDMX DDI RDF Etc. 16
GSBPM Generic Statistical Business Process Model • Applicable to all activities undertaken by producers of official statistics -> data outputs • Used by National and international statistical organisations • Independent of data source, can be used for: • Surveys / censuses • Administrative sources / register-based statistics • Mixed sources 17 Eurostat
GSBPM - Phases 18 Eurostat
The GSIM Generic Statistical Information Model ü GSIM provides a common language to describe information that supports the whole statistical production process, from the identification of user needs through to the dissemination of statistical products. ü GSIM is a strategic approach designed to bring together statisticians, methodologists and IT specialists to modernize and streamline the production of official statistics. GSIM is aligned with relevant data management and exchange standards, such as DDI and SDMX, but it is not directly tied to them, or to any specific technology. ü GSIM is a reference framework of internationally agreed definitions, attributes and relationships that describe the pieces of information (called “information objects” in GSIM) that are used in the production of official statistics. 19 Eurostat
The GSIM 20 Eurostat
GSIM as the reference model! Ø GSIM does not provide a standard representation of its own, and is intended to be implemented using existing external standards and models, which support technical implementation. This may involve mapping to internal data models or implementation standards used within an organization. Organizations implementing GSIM will need to map the GSIM objects against their implementation models. Ø For some common standard models, GSIM mappings have been provided. The design of GSIM takes into account the possibility to map to implementation models, such as SDMX or DDI. Such a mapping can be used to establish a link between GSIM and its technical implementation. GSIM facilitates the use of multiple standards by providing a reference model against which information standards can be mapped! 21 Eurostat
Conceptual model GSIM DDI Implementation standards SDMX Other relevant standards Geospatial standards 22
SDMX Statistical Data and Metadata e. Xchange SDMX UNSD World Bank Eurostat 23
SDMX – Why? • The exchange of statistical data and metadata is complex, resource intensive and expensive • In the past, national and international organisations had developed specific approaches and solutions • Opportunities and challenges related to new technologies for machine to machine exchange were coming up, e. g. XML, web services. 24 Eurostat
SDMX IS… ü A model to describe statistical data and metadata ü A standard for automated communication from machine to machine ü A technology supporting standardised IT tools In order to take advantage of all this: ü Statisticians agree to use a common description for data and metadata ü The data exchange process is then driven by this common description ü Data descriptions are made available for everybody who wants to understand reuse the data This is what SDMX provides and enables 25 Eurostat
The SDMX Components Describe statistics in a standard way Objects and their relationships § § Data Structure Definition (DSD), Concepts, Code List § Central management and standard access § SDMX Registry, SDMX Web Services § § § Cross Domain Concepts Cross Domain Code Lists Statistical Domains Metadata Common Vocabulary § Push § Provider generates and sends file to receiver Pull § § § Provider opens web service to data Receiver downloads regularly Hub § § Special case of pull: receiver downloads on end user request Eurostat 26
SDMX – From 1. 0 to 2. 1 Version 2. 0 2008 SDMX accepted at UN level SDMX-EDI SDMX-ML SDMX Registry Version 1. 0 SDMX recognised and supported as the preferred standard GESMES/TS September 2004 November 2005 Versio n 1. 0 Versio n 2. 0 February 2008 April 2011 Versio n 2. 1 27 Eurostat
All good standards change… § Danger (1): too much change may discourage adoption § Danger (2): not giving users the functionalities they want will discourage adoption Need to find a balance Eurostat
SDMX • Describes the structure of aggregate/dimensional data (“structural metadata”) • Provides formats for dimensional data • Provides a model of data reporting and dissemination • Provides a way of describing and formatting stand-alone metadata sets (“reference metadata”) • Provides standard registry interfaces, providing a catalogue of resources • Provides guidelines for deploying standard web services for SDMX resources • Provides a way of describing statistical processes 29 Eurostat
DDI v DDI Lifecycle can provide a very detailed set of metadata, covering: • The study or series of studies • Many aspects of data collection, including surveys and processing of microdata • The structure of data files, including hierarchical files and those with complex relationships • The lifecycle events and archiving of data files and their metadata • The tabulation and processing of data into tables (Ncubes) • Allowing for a link between the microdata variables and the resulting aggregates 30 Eurostat
Generic Statistical Business Process Model DDI SDMX DDI 31 Eurostat
GSBPM, DDI and SDMX: towards a complete picture? SDMX DDI 32 Eurostat
Ø Guidelines on SDMX and DDI implementation ü An important result of the evaluation work is the invalidation of the two following ideas: • “DDI is to be used for micro-data and SDMX for aggregates”. In reality, both SDMX and DDI can be used for representing some micro-data and aggregated data. • “DDI is to be used in early phases of the GSBPM, and SDMX in the late phases”. In reality, the type of activity does not seem to be the most relevant criteria for selecting an information standard. The suggestion is that the type of production activity or the level of detail of data should not be considered alone as sufficient criteria for the assignment of information standards to data objects manipulated in statistical production. 33 Eurostat
Use cases (1/2) Ø Survey data collection ü Representation of questionnaires ü Representation of survey data Ø Administrative & register data ü Representation register records ü Transactions on registers Ø Reference environment for the combined use of DDI and SDMX ü Setup of an integrated environment for handling micro-data and further aggregations using the appropriate standard ü Feasibility of a standard-agnostic metadata repository and registry which could manage metadata compliant with different standards, such as (but not only) SDMX and DDI 34 Eurostat
Use cases (2/2) Ø Statistical micro-data access and on-demand tabulation of micro-data ü Exploration whether the documentation and access to micro-data made available from Eurostat for research purposes would be improved by the use of DDI ü Relevance of DDI in the context of remote execution environments Ø Metadata and quality reporting ü Population of ESMS and ESQRS metadata reports with the contents of DDI instances 35 Eurostat
Standards-based modernisation of statistical production 36 Eurostat
37 Eurostat
The challenge • It's not about which flavour of XML we use (XML doesn’t really matter) • It’s about data and metadata! - If we are using different standards, how can we ensure that we are reusing the same metadata? • It's about the convergence of information models and the availability of an integrated IT environment 38 Eurostat
Data collection Ø DDI: The Data Documentation Initiative ü DDI is split into 2 branches: • DDI-Codebook (DDI-C): DDI-C is a light-weight version of the standard, intended primarily to document simple survey data. • DDI-Lifecycle (DDI-L or DDI 3+): DDI-L is designed to document and manage data across the entire life cycle, from conceptualization to data publication and analysis and beyond. DDI-L is currently being evaluated in several statistical organizations across the world. ü The DDI Lifecycle standard provides a data model for describing surveys in a very detailed fashion using XML. • This can support many parts of the process of survey management particularly in the case of households surveys. E. g. exchange between question banks and data collection applications, generation of collection instruments, … 39 Eurostat
DDI and SDMX ü DDI offers a very rich model for the documentation of micro-data ü SDMX offers a very integrated exchange platform for statistical outputs (IT architectures, tools, web services) The combined use of both standards could allow a higher level of integration of the complete production process But: The devil is in the detail! 40 Eurostat
Dissemination Ø RDF and Linked Open Data Tim Berners-Lee’s classification of five levels of open data: 41 Eurostat
Dissemination Ø RDF and Linked Open Data ü Linked data is an approach to publishing data on the web, enabling datasets to be linked together through references to common concepts. ü There a number of benefits to being able to publish multi-dimensional data, such as statistics, using RDF and the linked data approach: • • The individual observations, and groups of observations, become (web) addressable. This allows publishers and third parties to annotate and link to this data; for example a report can reference the specific figures it is based on allowing for fine grained provenance trace-back. Data can be flexibly combined across datasets (for example find all Religious schools in census areas with high values for National Indicators pertaining to religious tolerance). The statistical data becomes an integral part of the broader web of linked data. For publishers who currently only offer static files then publishing as linked-data offers a flexible, non-proprietary, machine readable means of publication that supports an out-of-the-box web API for programmatic access. It enables reuse of standardized tools and components. 42 Eurostat
Dissemination Ø RDF Data Cube Vocabulary (RDF/QB) 43 Eurostat
Current enhancements of SDMX Ø SDMX-JSON ü Java. Script Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute– value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. ü JSON has the same expressivity as XML but is a little more compact and much more direct to use with Java. Script frameworks. ü The updated specification concerning the use of JSON as standard SDMX representation syntax next to XML and GESMES/TS has been released. 44 Eurostat
SDMX for data validation Ø Structural Validation: Supported ü Assurance that the structure of the data observations matches the Data Structure Definition, in term of: • Concepts used as Dimensions, Measures, Attributes • Their admissible values (Code lists and values’ constraints) Ø Validation of the Information Content: Not yet supported ü Assurance that data give correct information about the real world, for example: • Completeness / Integrity • Accuracy / Plausibility • Coherence Ø Compilation and Estimation: Not yet supported Eurostat 4 5
VTL: Validation and Transformation Language Ø Activity led by the SDMX Technical Working Group (TWG) with the participation of other standards communities (DDI, …) • Some institutions have adopted a language having this aim (e. g. Bank of Italy, Eurostat, UNESCO, …) and use it for internal processing and for exchanging validation and calculation rules with their reporting entities and correspondents. • For the same aim, many other institutions are willing to adopt a similar language. Other communities (e. g. DDI, GSIM) are also interested in introducing and sharing a standard validation and calculation language. • Unless a standard language is introduced, such kind of languages would proliferate. 46 Eurostat
SDMX and other standards Ø References ü ü ü ü HLG Sp. S GSBPM GSIM CSPA SDMX official web site Eurostat SDMX Info Space BPMN BPEL SDMX VTL XBRL DDI Semantic. Web RDF/QB 47 Eurostat
48
So, let me just recap… Marco. Pellegrino@ec. europa. eu 49 Eurostat
SDMX IS… ü A model to describe statistical data and metadata ü A standard for automated communication from machine to machine ü A technology supporting standardised IT tools In order to take advantage of all this: ü Statisticians agree to use a common description for data and metadata ü The data exchange process is then driven by this common description ü Data descriptions are made available for everybody who wants to understand reuse the data This is what SDMX provides and enables 50 Eurostat
Use of cross-domain concepts 51
Concepts play roles in a Data Structure – Dimension Concepts that identify the observation value – Attribute Concepts that additional metadata about the observation value (as a value or the context of the value) – Measure Concept that is the observation value – Any of these may be • • • coded text date/time number etc. Representation 52
Data Structure Definition: Concept Roles Country (Dimension) Stock/Flow (Dimension) Unit Multiplier (Attribute) Unit of measure (Attribute) Time/Frequency (Dimension) Topic (Dimension) Observation (Measure) 53
Benefits 1. SDMX provides support for things that are essential to Statisticians, but are often difficult for them to achieve 2. International standard for holding all of the elements involved in the statistical process together in a clear information model 3. Approach that maximises the amount of information on the statistical context that can be passed through to users, and the capacity of linking statistics from different sources 4. Automation of processes: SDMX enables the development of common tools that can be used by all statistical organisations to improve their activities 54 Eurostat
Benefits (2) Statistical Organisation SDMX is also an advanced standard for data discovery using web-based services Web services enable query, visualisation, and automated loading of data and metadata. SDMX tools allow querying a database, or a file system, for the creation of tables, charts, and graphs from the results of the query. SDMX Reference Infrastructure Statistical Organisation SDMX Reference Infrastructure 55 Eurostat
SDMX Architecture (push mode) EDAMIS
SDMX Architecture (pull mode)
Hub approach PULL method register for data collection and dissemination SDMX Registry query NSI Hub P U L L Eurostat Pull Requestor Received data in SDMX-ML Loader Eurobase Dissemination e. DAMIS P U S H Dissemination Data Input Verification / Conversion To SDMX Intermediate storage XSL for SDMX-ML Warehouse storage
The European Census Hub: key issues q Dissemination of the data from the 2011 population and housing censuses in the European Union q Data that are methodologically comparable and structured according to “hypercubes” agreed with Member States (Census Regulation) q Providing users with an easy access to detailed census data (advanced functionalities) q Management of massive amounts of data produced and controlled by Member States q High accessibility to data and metadata & Easy to use q Harmonised concepts and definitions q Maximum flexibility to cross-tabulate data from different sources
The Solution! Data sharing SDMX format Pull approach and web services, independent from NSIs' systems Limited investment, re-usability (with the advantage of using recognized international standards) q decoupling of NSIs' systems from the central hub via standard formats and techniques for the exchange q Free of charge q q Delivered: q Census Hub, LIVE at https: //ec. europa. eu/Census. Hub 2 q SDMX-Reference Infrastructure (SDMX-RI)
Example of DSD for Table 6 (Marital Status) Dimensions ID CONCEPT CODELIST Attributes ID ATTACHMENT CODELIST LEVEL TIME Time period or range CL_TIME GEO Geographical area CL_GEO SEX Sex CL_SEX FST Family status CL_FST LMS Legal marital status CL_LMS CAS Current activity status CL_CAS POB Country/place of birth CL_POB COC Country of citizenship CL_COC AGE Age CL_AGE FREQ Frequency CL_FREQ OBS_STATUS Observation CL_OBS_STATUS OBS_LEVEL Observation CL_OBS_LEVEL OBS_NOTE Observation HC_NOTE Series Measures ID OBS_VALUE NAME Observation value
How the Hub works Eurostat Census Hub National Statistical Institute
https: //ec. europa. eu/Census. Hub 2/
Census Hub - Standardisation of the approach Data Provider = NSI (STANDARD, EXCHANGE, METADATA, REPOSITORY) DSD Mapping store Data Collector SDMX Mapping Assistant DSD Metadata repository Web Client Test Client Non-SDMX local database WEB SERVICES DSD SDMX query SDMX response Census Hub
Through the SDMX Hub, a data user can… q Browse the Hub to define a dataset of interest, navigating via structural metadata: - Search by topic (filters) and select data (level of detail, breakdowns) - Select layout (axes) q View a table q Export a file (CSV, Excel, SDMX-ML) q User can register by: - Providing a valid e-mail address - The system is generating random password - User is notified via e-mail for his password q Registered users can: - Save, retrieve, modify or delete stored queries - Receive an e-mail notification when offline queries are executed
The SDMX Implementation Process How to set up a project Use case: National accounts Eurostat Directorate C: National Accounts, prices and key indicators Eurostat
PRACTICE The SDMX implementation process can be broken down into key phases: - Phase 1: Preparation - Phase 2: Compliance - Phase 3: Implementation - Phase 4: Production 67 Eurostat
'By failing to prepare, you are preparing to fail. ' Benjamin Franklin 68 Eurostat
THEORY The 'Preparation Phase' is arguably the most critical phase. - It determines the objectives, scope, expected benefits and outputs of the project. - It requires a thorough discussion and assessment of needs, of stakeholders (and their roles) and of the business process. - It bears in mind policies, mandates, technological changes and different conceptual languages. 69 Eurostat
THEORY At the end of the preparatory phase, all stakeholders should be clear about: - The goals of the project; - The timeline for implementation; - A draft of the project plan; and - Their roles and responsibilities. 70 Eurostat
THEORY This is the phase of the project during which: - The groundwork is carried out; - The system is designed; and - The workflow is sequenced. It is the phase during which a high investment of time is likely to be required, in part because loops back into the preparation tasks are likely. 71 Eurostat
PRACTICE The following steps are taken in the compliance phase: - Analysis of the current data exchange; - Decision on the appropriate structure for the exchanged data; - Decision on DSDs to use (e. g. global DSDs) or - Creation of new DSDs covering the data exchange; - Development of draft maintenance agreements. 72 Eurostat
THEORY This is the phase of the project during which: - Plans are implemented; - Communication with stakeholders is emphasised; - Problems revealed / solutions sought / corrective action taken; 73 Eurostat
PRACTICE In the implementation phase, the following steps are taken : - Draw up an implementation timeline; - Finalise DSDs and maintenance agreement. - Set-up an appropriate IT infrastructure for SDMX compliant transmission; - Carry out pilot projects to: - review the DSD structure, and - test the infrastructure; 74 Eurostat
In the production phase, the following steps are taken: - SDMX compliant data exchange used in production; - Maintenance of the SDMX artefacts at regular intervals; 75 Eurostat
Eurostat SDMX Implementation Process 76 Eurostat 76
National Accounts BOP FDI GFS Main Aggregates Sector Accounts Supply & Use SDMX Input / Output
Ownership Group (OG) • Mandated by SDMX Sponsors in January 2014 • As a single OG for NA & GFS, BOP and FDI • Representation of management from content and SDMX side • Oversees maintenance of SDMX artefacts • Mandated technical groups • 1 on BOP and FDI; 1 on NA and GFS • Close cooperation with IAG Task Force international data cooperation (TFIDC)
SDMX-NA Governance • Maintenance Agency: Eurostat • Ownership Group: • Active: ECB, Eurostat, IMF, OECD, UNSD • Passive: BIS, World Bank • Published in the global registry: • http: //registry. sdmx. org • Annual maintenance cycle (April-April) • SDMX-NA maintenance page: • http: //sdmx. org/? page_id=1498
A possible data sharing model IMF, UN, WB, BIS, other IOs U S E R S SDMX Web Services OECD Eurostat ECB EU OECD other than EU World other than OECD
The necessary pre-conditions • SDMX maintenance agreements Ownership and maintaining agency to be defined (normally between the international organisations) • Internationally agreed data validation Data validation rules and SDMX data validation standards (beyond file structure validation) • Streamlined data exchange processes Frequency and timeliness for data dissemination, data quality management etc.
How to model a statistical domain Code Lists Other Concept Schemes N Data Structure Definitions Flow B # _T # # # Flow N # % CL_ATT 6 Concept 6 # CL_ATT 5 Concept 5 % CL_DIM 4 Concept 4 # CL_DIM 3 Concept 3 Flow A CL_DIM 2 Concept Scheme CL_DIM 1 Data Flows Concept 1 Structure Set Cube Region Constraints Eurostat
SDMX Information Model - Summary Reference DSD Data Flow Reference Concept Scheme Constraint Reference Code lists
Steps to model a statistical domain 1. 2. 3. 4. 5. 6. Agree on data exchange needs Define Concept Scheme Define Code Lists DSD Matrix: map Concepts to Dataflows Optimise matrix to reduce number of DSDs Derive SDMX artefacts If DSDs already exist: Use them National Accounts, BOP, FDI Eurostat
SDMX evolution Version 2. 0 2008 SDMX accepted at UN level SDMX-EDI SDMX-ML SDMX Registry Version 1. 0 SDMX recognised and supported as the preferred standard GESMES/TS September 2004 November 2005 Versio n 1. 0 Versio n 2. 0 February 2008 April 2011 Versio n 2. 1 85 Eurostat
Some new features • • • being considered VTL validation rules (rules, queries, etc. ) Better support for multiple measures Versioning of codes: time validity Referencing HCL, extending code lists Disabling dimensions Describing non-SDMX formats Geographical referencing (GIS) …… What about alignment to GSIM? SDMX-DDI convergence? 86 Eurostat SDMX Technical Working Group
- Slides: 86