Big Data ESSNet Web Scraping for Job Vacancy
Big Data ESSNet: Web Scraping for Job Vacancy Statistics Nigel Swier UK Office for National Statistics
Potential of On-line Job Vacancy Data Current Official Estimates (Survey) Online data Monthly (Rolling Qtr) Real-time? Industry Sector Enterprise Size Job type / skills Geography National Totals Frequency More frequent More timely More granular Less burden Cheaper? ? ?
Six challenges with using On-line Job Vacancy (OJV) data for statistical purposes 1. Not all jobs are advertised on-line. Coverage is therefore incomplete and not fully representative. 2. There is no definitive data source 3. Much OJV data is unstructured. Text processing and analysis is required to extract useful information. 4. Some job advertisements are not within the scope of official statistics definitions of a job vacancy (EU) 5. The official definition of a job advertisement does not correspond directly to the concept of a live job advertisement 6. The specific job vacancy data landscape varies between countries
On-line Job Vacancy Data Landscape Employers Job Boards Private Employment Agencies Job Search Engines National Employment Agency Enterprise Websites Data Aggregators Public Policy Official Job Vacancy Statistics Cedefop
Approaches to Data Access • Direct web scraping • Point and click • Progammatic (e. g. Python Scrapy) • Web-scraping enterprise websites • Agreed Access • • National employment agency Private job portals Commercial providers CEDEFOP Images: Creative Commons
Data Handling: Text analysis and classification [e. g. classifying textual data with machine learning] Occupation is fairly straightforward in this case Industry is more difficult. This company is an employment agency not the employer. But there are clues…. Can industry and occupation be classified from a job ad?
Data Handling: Flow to stock transformation Job Vacancy Lifecycle
Assessment against survey aggregates: by industry sector
Conclusions/Questions • Agreed access arrangements are generally better than direct web scraping • OJV data cannot replace the Job Vacancy Survey (in EU) • OJV data does not correspond to target concepts and only measures part of the labour market. How useful are these measures? • If useful, how should these measures be presented alongside the official estimates? • (EU) Collaboration with CEDEFOP is essential. How do we get the best possible quality data for official statistics purposes?
Other possibilities ? • Time series analysis, leading to flash estimates ? • Data driven analysis: new insights into existing statistics, e. g. timing of advertisements being placed ? • International/Overseas jobs as indicators of labour market “tightness” ? • Identification of new job titles – assisting development of standard statistical classifications ?
Drivers of Cedefop RLMI work Better labour market information for better policies Lack of comparable data and systematic analysis Complement skills intelligence toolkit
- Slides: 11