VTL Validation and Transformation Language IPA training course

  • Slides: 18
Download presentation
VTL: Validation and Transformation Language IPA training course Item 4 Luxembourg, 18 May 2017

VTL: Validation and Transformation Language IPA training course Item 4 Luxembourg, 18 May 2017 Maurizio. Capaccioli@ec. europa. eu Eurostat, Unit B 1

VTL: the origin • Implementation of the SDMX information model package on "Transformations and

VTL: the origin • Implementation of the SDMX information model package on "Transformations and Expressions" • Based on a generic information model that can be used with different standards: SDMX, DDI, GSIM or others • VTL is maintained by the VTL Task Force, composed of members of Eurostat, ECB, ILO, INEGI, Bank of Italy, ISTAT • The VTL Task Force works under the umbrella of the SDMX Technical Working Group

SDMX, which stands for Statistical Data and Metadata e. Xchange is an international initiative

SDMX, which stands for Statistical Data and Metadata e. Xchange is an international initiative that aims at standardising and modernising (“industrialising”) the mechanisms and processes for the exchange of statistical data and metadata among international organisations and their member countries. SDMX is sponsored by seven international organisations including the Bank for International Settlements (BIS), the European Central Bank (ECB), Eurostat (Statistical Office of the European Union), the International Monetary Fund (IMF), the Organisation for Economic Cooperation and Development (OECD), the United Nations Statistical Division (UNSD), and the World Bank.

VTL – purposes 1. provide an unambiguous language to communicate validation rules between different

VTL – purposes 1. provide an unambiguous language to communicate validation rules between different statistical organisations 2. provide a high-level language to document the data transformations 3. provide an efficient language for implementing data validation services 4. provide an efficient language for implementing data transformations

Versions of VTL • VTL 1. 0 published in March 2015 • Collection of

Versions of VTL • VTL 1. 0 published in March 2015 • Collection of comments (public review) • VTL 1. 1 published in November 2016 • Collection of comments (public review) • VTL 1. 1 final release will be published in July 2017 • SDMX web site: http: //sdmx. org/? page_id=5096

VTL – main principles Most of the VTL operators operate on datasets A dataset

VTL – main principles Most of the VTL operators operate on datasets A dataset is described by dimensions, measures and attributes Example: ds_bop_1 REF_AREA EU 25 BG RO EU 27 PARTNER CA CA TIME 2010 OBS_VALUE 20 1 1 23 OBS_STATUS D P P P Dimension Measure Attribute

VTL – main principles Example of a typical VTL operation: ds 3 : =

VTL – main principles Example of a typical VTL operation: ds 3 : = ds 1 + ds 2 Operations carried out by VTL: • join the data points of the ds 1 and ds 2 using the dimension values • apply the scalar function "+" to all pairs of numeric measures of ds 1 and ds 2 having the same name • if desired, execute an attribute propagation function defined by the user (e. g. concatenate the "flag" attribute of the two data points) • create a temporary dataset containing the resulting data points

Example of VTL validation rules Hierarchical validation rules Time-series rules Data point validation rules

Example of VTL validation rules Hierarchical validation rules Time-series rules Data point validation rules Boolean conditions check ( ds 1#obs_value >= 0 )

VTL - hierarchical ruleset Hierarchical ruleset: hr_euro_agg Example of VTL validation rules N. Antecedent

VTL - hierarchical ruleset Hierarchical ruleset: hr_euro_agg Example of VTL validation rules N. Antecedent variables: Rule variables: 1 time ref_area EU 15 = AT + BE + LU + DE + ES + FI + FR + EL + IE + IT + NL + PT + DK + UK + SE 2 EU 25 = EU 15 + CY + CZ + ES + HU + LT + LV + MT + PL + SK + SI 3 EU 27 = EU 25 + BG + RO 4 EU 28 = EU 27 + HR 5 time between 1995 and 2003 EU = EU 15 6 time between 2004 and 2005 EU = EU 25 7 time between 2006 and 2012 EU = EU 27 8 time >= 2013 EU = EU 28 9 EEA 15 = EU 15 + IS + NO + LI 10 EEA 25 = EU 25 + IS + NO + LI 11 EEA 27 = EU 27 + IS + NO + LI 12 EEA 30 = EU 27 + IS + NO + LI 13 time between 1995 and 2003 EEA = EEA 15 14 time between 2004 and 2005 EEA = EEA 25 15 time between 2006 and 2012 EEA = EEA 27 16 time >= 2013 EEA = EEA 30

VTL – datapoint validation ruleset create datapoint ruleset dr_flow_positive ( flow, obs_value ) when

VTL – datapoint validation ruleset create datapoint ruleset dr_flow_positive ( flow, obs_value ) when flow = "IMP" or flow = "EXP" then obs_value > 0 ; end horizontal ruleset The datapoint ruleset: • is defined on the variables flow and obs_value • verifies that in each data point of the dataset to be validated (not shown here) the component obs_value is greater than zero when the flow is "IMP" or "EXP". • the above syntax creates a ruleset (a permanent object) named "dr_flow_positive" 10 ESTAT

VTL – checking boolean conditions ds_result : = check ( ds 1 # obs_value

VTL – checking boolean conditions ds_result : = check ( ds 1 # obs_value > 1000, errorcode ( "Value must be greater than 1000" ) ) 11 ESTAT

Exercise 1 VTL code: ds_result : = check ( ds_bop # time_period between 2008

Exercise 1 VTL code: ds_result : = check ( ds_bop # time_period between 2008 and 2015, errorcode(“_____”), errorlevel(“Error”) ) ; ds_bop is the dataset containing the data to be validated Question: What is the correct text (error message) to be inserted in _____ ? 12 ESTAT

Exercise 2 ds_bop 1 REF_AREA EU 25 BG RO EU 27 PARTNER CA CA

Exercise 2 ds_bop 1 REF_AREA EU 25 BG RO EU 27 PARTNER CA CA TIME 2010 OBS_VALUE 20 1 1 23 OBS_STATUS D P P P VTL code: result : = check ( ds_bop 1 # obs_value, hr_euro_agg ) ; hr_euro_agg is the hierarchical ruleset described in slide 9. Question: What is that data point contained in the ds_result dataset? 13 ESTAT

Exercise 3 ds_bop 1 REF_AREA EU 25 BG RO EU 27 PARTNER CA CA

Exercise 3 ds_bop 1 REF_AREA EU 25 BG RO EU 27 PARTNER CA CA FLOW IMP IMP TIME 2010 OBS_VALUE 20 1 0 23 OBS_STATUS D P P P VTL code: ds_result : = check ( ds_bop 1, dr_flow_positive ) ; dr_flow_positive is the datapoint ruleset described in slide 10. Question: What is that data point contained in the ds_result dataset? 14 ESTAT

VTL – assessement of usability Assessment of usability by statisticians: • Covering several domains:

VTL – assessement of usability Assessment of usability by statisticians: • Covering several domains: Animal Production, Asylum, International Trade in Services, Energy, National Accounts, Short Term Statistics • Participation of 8 countries + Eurostat Some comments received: • Rules in plain english and examples of bad/good data are both essential • Rules in VTL may be useful as complement (to limit risks of ambiguity) • Need to agree on way to express the rule (negative or positive) 15 ESTAT

Development of VTL tools IT tools and services under development: • ECB • Norway

Development of VTL tools IT tools and services under development: • ECB • Norway VTL parser Java API based on JSON-stat format https: //github. com/statisticsnorway/java-vtl • Poland VTL to SQL translator UNECE paper • • Istat ESTAT VTL Editor Compiler (part of the Validation Service) Validation Rule Manager Sandbox: simple GUI + VTL translator to SQL ESTAT 16

Use of VTL: • ECB BIRD portal VTL is used to document the data

Use of VTL: • ECB BIRD portal VTL is used to document the data validations and transformations of the statistical process: http: //banks-integrated-reporting-dictionary. eu/bird-group • Continuous Capture of Metadata VTL is used as a common language to describe data transformations http: //c 2 metadata. org/ 17 ESTAT

Thank you for your attention! Any questions?

Thank you for your attention! Any questions?