Dealing with Big Data for Official Statistics IT
Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics
Outline 1. Background: Istat Big Data strategy and experimental projects 2. IT issues in experimental projects 3. Final remarks Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 2
Istat Big Data Strategy - 1 § Istat (The Italian National Institute of Statistics) set up a technical Commission with the objective to orient investments on Big Data adoption in statistical production processes § Duration: from February 2013 to February 2015 § Members coming from different areas: Official Statistics, Academy, Private Sector Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 3
Objective of the talk § I will NOT deal with (just) technological issues § I will deal instead (mainly) with IT methodological issues § Example: . Map. Reduce-Hadoop : Open Source Framework Map-Reduce Programming Model: are OS stat methods mapreduce-able? Current research for investigating mapreduceability of (classes of) computational problems Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 4
Istat Big Data Strategy - 2 § The Commission will release a strategy for Big Data adoption § Three experimental projects launched and monitored by the Commission: § Persons and Places § Labour Market Estimation based on Google Trends § ICT Usage in enterprises based on Internet as a Data Source (Ia. D) § Status: advanced implementation (first results already available) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 5
Persons and Places § Purpose § Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data § Actors involved in the project § Istat § National Research Council § University of Pisa § Methodology § Inference of population mobility profiles from GSM Call Data Records (CDRs) § Comparison with data derived from administrative sources Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 6
Labour Market Estimation § Purpose § Test the usage of Google Trends forecasting and nowcasting purposes in the Labour Force domain § Actors involved in the project § Istat: Central Methodology Sector and Labour Force Survey § Methodology § Autoregressive model vs. Usage of Google Trends as prediction models § Comparison extended to macroeconomics prediction models Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 7
ICT Usage in Enterprises § Purpose: § Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions § Actors involved in the project: § Istat: Survey on the ICT Usage in Enterprises § Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) § Methodology § Scraping of web sites for data extraction § Supervised classification task Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 8
Features of Experimental Projects 1: Persons & Places 2: Google Trends • Web search record 3: ICT Usage • Web data extraction DATA SOURCE • Mobility data SCENARIO (IMPACT ON THE PRODUCTION PROCESS) • Deep impact: • Considerable source replaces impact: traditional estimation sampling and phase collection KEY TECHNOLOGIES • Scraping • No. Sql • Machine learning libraries • Google Trends • Map. Reduce/ libraries • Map. Reduce/ Hadoop (future) Hadoop (future? ) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 • Limited impact: subset of data gathered by using Ia. D 9
Statistical Phases for Big Data Management Inversion of the two phases Collapsed phases • Principal selected phases • Inversion due to the fact that “traditional” design phase is not anymore present for Big Data • Collapse due to the fact that same methods can be used for both phases • Other phases, e. g. Dissemination, not (yet) involved in Big Data Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 10
Collect: IT issues - 1 § Access to Big Data sources: § Type 1: Access control mechanisms that the Big data provider designedly set up and/or § Type 2: Technological barriers § Google Trends: § Absence of APIs, preventing from the possibility of accessing GT data by a software program § Not possible to foresee the usage of such a facility in production processes Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 11
Collect: IT issues - 2 § ICT Usage: Both type 1 and 2 problems § 8. 647 URLs of enterprises’ Web sites, but only about 5. 600 were actually accessed § Type 1: Scrapers deliberately blocked, e. g. mechanisms in place to verify human access to sites, like CAPTCHA § Type 2: Usage of some technologies like Adobe Flash simply prevented from accessing contents Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 12
Design: IT Issues - 1 § Even if a traditional survey design cannot take place, the problem of “understanding” the data still present § Semantic extraction techniques § Knowledge representation and natural language processing § E. g. : FRED (http: //wit. istc. cnr. it/stlabtools/fred) permits to extract an ontology from sentences in natural language Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 13
Design: IT Issues - 2 § ICT Usage: § Human inspection refined by: § Some NLP techniques: Tokenization, Stemming, Stopwords removal, etc, § Semantic enrichment by semantic dictionaries (Word. Net) § Images: tag extraction Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 14
Process/Analyse: IT Issues - 1 § Big size, possibly solvable by Map-Reduce algorithms § Model absence, possibly solvable by learning techniques § Privacy constraints, solvable by privacypreserving techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 15
Process/Analyse: IT Issues - 2 § Map-Reduce algorithms: problem of formulating algorithms that can be implemented according to such a paradigm ->”mapreduce-ability” § Recent state of the art Map-Reduce algorithms for: § Basic graph problems, e. g. minimum spanning trees, triangle counting and matching § Combinatorial optimization, e. g. maximum coverage, densest subgraph, and k-means clustering Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 16
Process/Analyse: IT Issues - 3 § Persons and Places: § Match mobility-related data with data stored in Istat archives § Record linkage problem should be solved (future task) § Model Absence: neither survey-based nor “traditional” model-based approaches directly applicable to Big Data § Possible usage of machine learning techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 17
Process/Analyse: IT Issues - 4 § ICT Usage: § Supervised learning methods used to learn the YES/NO answers to the questionnaire (including, classification trees, random forests and adaptive boosting) § Persons and Places: § Unsupervised learning technique, namely SOM (Self Organizing Map) to learn mobility profiles § E. g. “free city users” vs. “embedded city users” (more confidently estimated by deterministic constraints) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 18
Process/Analyse: IT Issues - 5 § Privacy constraints could potentially be solved in a proactive way by relying on techniques that work on anonymous data § Privacy-preserving data integration, e. g. [DMKM -2004] § Privacy-preserving data mining, e. g. [TKDE 2004] § Persons and Places: Anonymous matching of CDRs with Istat archives via privacy-preserving record linkage Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 19
Concluding Remarks § Illustration of some IT issues considered as relevant for Big Data adoption by OS on the basis of practical experiences § Probably technology is not an issue but IT methodology is an issue!!! § Some IT issues also share some statistical methodological aspects § Other relevant IT issues: § Event data management § On-line analytics Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 20
Thank you for the attention! Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 21
- Slides: 21