Trusted Smart Statistics Centre Web Intelligence Hub The
Trusted Smart Statistics Centre Web Intelligence Hub
The case for a web intelligence shared system
A brief history of the use of Web data in European statistics • ESSnet Big Data I: OJA & enterprises websites • Exploration of the data source • ESSnet Big Data II: OJA & enterprises websites • Towards implementation • Prices • Wikipedia page views – culture and regional statistics
Moving big data to implementation • ESSnet Big Data I: pilots • Call for a refocus on the implementation of the most successful pilots • ESSnet Big Data II: new pilots + trusted smart statistics + implementation • Implementation = producing specifications for implementing, experimental statistics, recommendations for data / process governance • Online job advertisements • Enterprises websites • Smart electricity meters
Online job advertisements (OJA) • Advertisements published on the World Wide Web on job vacancies available in enterprises • Include data on • characteristics of job (occupation, location, …) • characteristics of employer (economic activity, …) • requirements (education and skills, …) • Partly available only as natural language data • requires specific methodologies of processing and analysis but also much higher information richness and avoids pre-conceived classifications (e. g. important to identify emerging skills)
OJA data collection systems • Mostly national approach • ESSnet Big Data I & II • European approach • Cedefop • DG-CNECT ü MOVIP - Monitor of Online Vacancies for ICT ü Vacancies for ICT- Online Repositor. Y (VICTORY) ü VICTORY 2
http: //www. pocbigdata. eu/monitor. ICTonlinevacancies/general_info/
The case for a web intelligence shared system • Acquisition of Web data in a statistical production context is not easy (e. g. data agreements) • Infrastructure with big data capabilities is required • Specialised skills are required • Web intelligence capabilities spread out the ESS will take very long • An European system for OJA will exist anyway
Trusted Smart Statistics Centre (TSSC) Web Intelligence Hub (WIH)
Trusted Smart Statistics Web Intelligence Online job vacancies Wikipedia Enterprise websites Internet platforms … Trusted Smart Surveys Mobile Network Operator Data Transport and Logistics Smart Systems Time use Household budget … Human presence and movements Tourism Vessel traffic Air traffic Railway traffic … Earth observation Smart farming Smart devices Io. T for smart cities Smart traffic … … Community of experts Provide the ESS with core capabilities to harness new data sources /channels and integrate them in production
Use Case 1 Skills Statistics layer Application layer Services layer Data layer Skills statistics Use Case 2 SM 4 MNE Augmented job vacancies statistics Online Job Advertisements Augmented EGR Enhanced prices statistics Multinational enterprises business data Online prices Web Intelligence Hub Web data
Web Intelligence Hub - Services • Provide support to ESS partners in • • • Data acquisition (web scrapping, APIs) Trans-national data agreements Partnership models for national data agreements IT infrastructure and tools Analytical services (e. g. NLP) Methodology Regulatory aspects Skills (training material) R&D collaboration Governance
Web Intelligence Hub - Principles Some principles • ESS hub • Serving national and European needs • Modular structure • Defined processes and products to be guaranteed • Priority to working together, possibility to act individually • Programs should be open source • Transparency as much as possible • Common used processes should be certified and audible • Lineage of data and processes • Intermediate products usable by all partners
Towards Common ESS WIH architecture
1 st WIH use case: Online Job Advertisements Cedefop system
Landscaping Data pipeline
The objects of interest
Extracting data from an OJA Web scraping: extracting data from web pages DATA SCIENTIST Location: Luxembourg Application deadline: Monday, 30 September 2019 Reference number: 100 APPLY NOW Title: Data scientist Area: Luxembourg Time: Saturday, 30 September 2019 Description: Structured data Home > Now Hiring > Data Scientist Description This role will directly support the Head of Data Labs in the design and implementation advanced analytics as part of the Group-wide Data and Analytics strategy. We are seeking a high calibre creative thinker who can turn business problems into solutions by applying sophisticated and targeted analytics. The role will be project based, working as part of the wider Data Labs team to put analytics into practice across the organisation. This role will directly support the Head of Data Labs in the design and implementation … Natural language data
Extracting data from an OJA natural language
Extracting data from an OJA (Cedefop) Variable From structured fields From natural language (NL) Overall precision rate (NL) Occupation 4. 74% 100. 00% 85% Education level 4. 50% 99. 86% 99% Experience 2. 68% 45. 64% Contract 29. 76% 79. 64% 99% Economic activity 39. 92% 100. 00% 92% Working hours 24. 14% 64. 66% 99% Place 88. 42% 97. 86% n. a. Salary 23. 32% 21. 38% n. a. 0. 81% 90. 43% n. a. 100. 00% n. a. 13. 18% 100. 00% n. a. Skill Release date Expire date n. a.
Main skills required for Office professionals
Occupations for which the skill “Use Microsoft office” is most frequently required
Enhanced job vacancies statistics Nowcasting More regional detail
Landscaping Stakeholders engagement (OJA example) Eurostat coordinates NSIs distributed Cedefop participates EC DGs NOT involved Eurostat responsible Shared between NSIs, but NSIs: can deploy nationally can develop better classifiers can submit new sources to be ingested Cedefop follows EC DGs NOT involved Eurostat disseminates new statistics NSIs disseminate national statistics Cedefop disseminates its dashboards EC DGs have their specific tools (designed with Eurostat support)
Roadmap
ESS Big Data Steering Group 12. Feb. 2019 • highlighted its strategic potential for the ESS; • reminded importance of addressing methodology, data quality, IT, architecture and governance; • recognised and stressed widened scope of project to additionally include information on skills; • pointed out importance of multi-purpose, modular approach and need to integrate national, local and European needs; • It would provide a crucial element for ESS capacity building and reputation;
ESS Big Data Steering Group 12. Feb. 2019 • Eurostat undertook to elaborate further the proposal in collaborations with the • ESSnet Big Data II work package B (online job vacancies) and • ESS Task Force on Big Data for Official Statistics.
Milestones 1. Memorandum of understanding Eurostat / Cedefop (Dec 2019) 2. Design functional architecture of Web Intelligence Hub – WIH (March 2020) 3. Start building WIH components (2020) 4. Mirroring Cedefop system on EC infrastructure (2020) 5. Adaptation to web intelligence hub architecture (2021) 6. Joint running production system
WI Platform: state of play • Data Science Lab • V 1. 0 deployed Feb 2020 • OJA production playground • Environment set up: 30 April 2020 • Advanced coaching: May 2020 • Eurostat production environment • Environment set up: 30 September 2020 • Parallel running: October – December 2020 • Solo running: 2021 • Web Intelligence Platform • Design: 2020 • Porting OJA to WIP: 2021
Study on Inferring job vacancies from online job advertisements OBJECTIVES OF THE CONTRACT • Develop estimator(s) for the number of job vacancies from data on online job advertisements in order to be used to enhance JVS. TASKS 1. Development of an inference method (Jan 2020) 2. Empirical test of the inference methods • • • Empirical test for the preliminary list of Member-States. (Mar 2020) Empirical test for all countries for which data is available from Cedefop for the finally selected method of inference and explanation of the several steps performed (Apr 2020) Source code of the empirical test in R properly documented (May 2020) 3. Presentation to stakeholders (May 2020) 4. Scientific summary on the project findings from all the above deliverables to be included in the Eurostat Statistical Working Paper collection series (Jun 2020)
- Slides: 31