ESSnet on Datawarehousing the business register Pieter Vlag
ESSnet on Datawarehousing - the business register Pieter Vlag – Statistics Netherlands
Outline of the presentation • Data. Ware. House and importance population frame • relationship population frame - business register - • (default) target population, statistical units other crucial datasources: “backbones” - turnover + employment • datalinking : the statistical unit base • conflicting information between datasources - when correcting in statistical DWH - when correcting in backbones - when feedback to business register ESSnet DWH – business register 1
Definition of a statistical Datawarehouse (according to the FPA) The broad definition of a data warehouse to be used in this ESSnet is therefore: ‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source. ’ ESSnet DWH – business register 2
A Data. Warehouse: the general idea As staging area is “core business” for NSIs, DWH – business register term ESSnet statistical DWH is used for staging area + Ware. House 3
The statistical Data. Warehouse: architecture and layers ESSnet DWH – business register 4
The statistical Data. Ware. House: processing steps the GSBPM model 5. 1 a: link data 5. 1 b: integrate data see presentation Fursova input process DWH / int. data Calculate aggregates ESSnet DWH – business register 5
GSBPM -step Output 1 Output 3 Output 2 7 6 Integrated data 5. 7 p. analyse 5. 2 5. 6 Processing (integration layer) 5. 1 Linking Titel van de 1 presentatie datasource 2 datasource 3 4 6
A datawarehouse without population frame Datasource I: Survey 1 Datasource I: Admin data Datasource I: BIG DATA Datasource I: Survey 2 • • different sources cover different enterprises -> information about ? ESSnet DWH – business register timing of availability sources differs -> when complete desc. available ? 7
ADVANTAGE: Datasource 4: survey 2 Datasource 3: survey 1 Datasource 2: BIG DATA Datasource 1: admin data 1 Population. A Datawarehouse with a population frame 8 DWHof – business the. ESSnet coverage DWH isregister known (e. g. which enterprises are included in a DWH)
Units and target population The population should be known for the • datawarehouse; e. g. “about which enterprises info” • its preparation phase ; e. g. when linking data sources Challenges are: • units may differ between the data sources - decision: which unit used for linking • what is the reference population - decision: how is the default target population defined ESSnet DWH – business register 9
Proposals • • - Only statistical unit (=enterprise) is used for data-linking - in processing phase of the statistical - DWH - justification: most obvious, ESSnet on Consistency, maintenance Default target population : all enterprises with economic activity in reference period (e. g. year) - justification: SBS-regulation widest definition of enterprises from which flexible outputs for subpopulations can be derived term default is used: as subpopulations do have a target population, too ESSnet DWH – business register 10
GSBPM -step 7 flexible output for different populations, and units Integrated data Weighting to flexible pop. 6 5. 7 5. 2 Processing on stat. unit + default target population only 5. 6 Linked data flexible datasources with different populations and units Titel van de presentatie 5. 1 4 11
Population frame and the Business Register Determination of the default target population in SDWH in 2 steps: • the population frame, i. e. a list of enterprises with a certain kind of activity during a period. • confirmation which enterprises of the list really performed economic activities during a period The business register provides information for the population frame. Therefore, the statistical Business Register is an indirect datasource for the statistical-DWH ESSnet DWH – business register 12
Information needed from stat. business register Recommended information for the population frame : • • • the frame reference year the statistical enterprises unit, including national ID and EGR ID the name and address of the enterprise the national identification number (ID) of the enterprise the date in population (mm/yr) the date out of population (mm/yr) the NACE-code the institutional sector code a size class ESSnet DWH – business register 13
Other backbones ESSnet Admin. Data: VAT and social security admin • • almost complete for quarter and annual can be used for high-quality estimates for turnover + employment respectively. ESSnet DWH: VAT and social security data are crucial • • to confirm the activity status of enterprises implictly to determine the default target population to integrate data suitable for flexible outputs measurement errors are reduced of sample survey (or data about subpopulation) if weighting to pop. numbers + VAT-turnover + employment Proposal: to include these admin data as backbones in a stat-DWH ESSnet DWH – business register 14
Backbones in a statistical-DWH Access layer GSBPM 7 -9: disseminate Integration layer GSBPM 5. 7 -5. 8: calculate aggregates GSBPM 5. 2 -5. 6: “process” Int. + Analyses layer GSBPM 6: analyse / “DATAWAREHOUSE” Check processing GSBPM 5. 1: link & integrate data 1 Pop-frame data 2 VAT empl. Source layer Backbones are crucial for data-linking and data-integration; -> need to be checked/cleaned by source in the source layer SBR 15
Observed: admin data incorporated in BR GSBPM 7 -9: disseminate Integration layer Int. + Analyses layer When choosing this option, GSBPM 6: analyse GSBPMpart 5. 7 -5. 8: calculateprocess aggregates - important of linking outside the S-DWH - unless S-DWH integral part of S-DWH (maintenance ? ) GSBPM 5. 2 -5. 6: “process” Check processing GSBPM 5. 1: link & integrate data 1 data 2 Pop-frame Source layer VAT SBR empl. 16
Determining default target population If statistical-DWH covers annual statistics only • relatively straightforward - derive population frame from business register at the end of reference year t - determine active or non-active as soon as VAT and/or employment data become available If STS included in statistical-DWH more complicated: - updating necessary ! ESSnet DWH – business register 17
Updating population ESSnet DWH – business register 18
The largest enterprises output 2 output 1 GSBPM 5. 7 -5. 8: calculate aggregates output 3 “DATAWAREHOUSE” If a team within a NSI produces consistent microdata for largest enterprises GSBPM 5. 2 -5. 6: as “process” -> consider this source backbone Check processing GSBPM 5. 1: link & integrate VAT Pop-frame data 1 SBR empl. L. E. data 2 19
Units: ideal situation • enterprise has a unique ID • enterprise group has a unique ID enterprise and enterprise group • correspond with statistical definitions • are used in all data sources In practice more complex situations do exist (especially when using more admin data) ESSnet DWH – business register 20
Key question: how to manage these different in- and output units and their relationships to the statistical unit Flexible output for different populations, and units 7 6 5. 7 Integrated data processing on one unit + one population only Linked data flexible datasources with different population and units Titel van de presentatie GSBPM -step 5. 2 5. 6 5. 1 4 21
INPUT IN S-DWH OUTPUT processing Legal unit KAU “Accountìng” unit LKAU “VATunit” Local unit ENTERPRISE (=statistical unit) ENTERPRISE GROUP “other tax” units enterprise Enterprise group other units ESSnet DWH – business register 22
The unit base Some remarks: • Complexity of unit base depends on - scope of statistical-DWH - national legislation (practices) with respect to enterprise units • Unit base closely related to Business Register. Main motivation to place this base outside the Business registers - more flexible in case of new in- and outputs - more transparent in case of linking errors ESSnet DWH – business register 23
Position of Business Register in stat -DWH output 2 output 1 GSBPM 5. 7 -5. 8: calculate aggregates output 3 “DATAWAREHOUSE” GSBPM 5. 2 -5. 6: “process” Check processing GSBPM 5. 1: link & integrate Pop-frame survey tax L. E. other units SBR BIG DATA VAT empl. 24
Feedback to Business Register In case of conflicting information between datasources and conclusion is influential error in backbones (and indirectly SBR) • When incorporating corrections in statistical DWH ? • When incorporating corrections in backbones ? • When incorporating corrections in SBR? ESSnet DWH – business register 25
Correction of information output 1 output 2 output 3 GSBPM 5. 7 -5. 8: calculate aggregates “DATAWAREHOUSE” GSBPM 5. 1 -5. 6: “process” Check processing In SDWH: corrections at 5. 6 GSBPM 5. 1: link & integrate survey units other Pop-framethemselves: In backbones VAT timing most empl. important L. E. revisions 26 SBR end of year (for consistency) – exception major impact In SBR: after
Conclusions Requirements for statistical-DWH • Population well defined • Use of one unit in processing Backbones desired for • populations, VAT-turnover, admin data employment, large enterprises Business Register is indirect input for statistical DWH • population frame, unit base, survey Timing of corrections errors (backbone information) • in DWH: before weighting • in backbone: when revising • in Business Register: end of year ESSnet DWH – business register 27
- Slides: 28