Structured POI data Extraction from Internet News Dr

  • Slides: 30
Download presentation
Structured POI data Extraction from Internet News Dr. Kevin Zhang, Associate Professor Beijing Institute

Structured POI data Extraction from Internet News Dr. Kevin Zhang, Associate Professor Beijing Institute of Technology Email: kevinzhang@bit. edu. cn http: //hi. baidu. com/drkevinzhang/

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Motivation ìA point of interest, or POI, is a specific point location that someone

Motivation ìA point of interest, or POI, is a specific point location that someone may find useful or interesting.

Motivation II ì POI data includes: n. A name or description n the related

Motivation II ì POI data includes: n. A name or description n the related products or services n a telephone number. ì POI is very useful for a common user to locate his destination. ì POI must be updated! Loss with outdated POI cannot underestimated.

However ì POI changes (new building, rename, road modification, traffic limit) rapidly, especially in

However ì POI changes (new building, rename, road modification, traffic limit) rapidly, especially in the developing procedure of a city. n The map of Wuhan changes every quarter. n Beijing should modify its map every month.

Motivation III ì GIS data suppliers have to keep hundreds of vehicles running and

Motivation III ì GIS data suppliers have to keep hundreds of vehicles running and recording any change at each location from morning till night. Without any hint or schedule, such aimless circling is expensive and timeconsuming.

Motivation IV ì Related orgnization tends to instantly announce POI change on the news

Motivation IV ì Related orgnization tends to instantly announce POI change on the news articles or BBS pages. ì This work addresses the POI extraction model to automatically generate POI data from news articles.

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Challenges ì Chinese named entities in POI data are hard to recognize due to

Challenges ì Chinese named entities in POI data are hard to recognize due to the word segmentation problem. 兰州市 城关区 皋 兰路 38号 “位于城关区皋兰路 38号 的兰州市第二家‘竞彩’加 盟店正式开门纳客。” (or The second lottery franchise store was formally opened at No. 38 Gaolan Road, Chengguan District) Specific location name recognition Specific orgnization name recognition 兰州市 第二家 ‘竞 彩’ 加盟店

Challenges ì POI extraction has to tackle some tough natural language processing (NLP) tasks,

Challenges ì POI extraction has to tackle some tough natural language processing (NLP) tasks, such as disambiguation, temporal inference and coreference resolution. For instance, “成都银行重庆分行本月底在渝开业”(or Chongqing Branch of Bank of Chengdu was opened in Yu at the end of this month). n Here, both “重庆”(or Chongqing) and “渝”(or Yu, abbreviation of Chongqing) are co-referred as POI location. n “本月底”(or at the end of this month) and the publish time(Mar. 2010) date: 2010 -3 -31 n

Challenges ì Last one but not least, there should be taken more account on

Challenges ì Last one but not least, there should be taken more account on relationship between different entities. 杭州蚂蚁搬家公司专业写字楼, 学院搬迁, 居民搬家 - 滨江搬家 - 杭 州 58同城 搬家公司:A moving company 公司搬家:a company moved its office

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Architecture

Architecture

Text Preprocessing Erasing noisy news document with POI linguistic 12月/t 6日/t ,/wd 车辆/n 驶上/v

Text Preprocessing Erasing noisy news document with POI linguistic 12月/t 6日/t ,/wd 车辆/n 驶上/v 聚/v (/wkz features 源/ng )/wky 青/a (/wkz 城山/ns )/wky 路/n 。 当日/tlexicon ,/wd 四川省/ns 都江堰市/ns 聚源 ì A/wj location with 640, 000 entries was 镇/nsi 至/p 青城/ns 山/n 道路/n 通车/vi 。 imported /wj 该/rz 路/n 是/vshi 都江堰/ns 启动/v 灾/n ì Making lexical analysis on the remaining 后/f 重建/v 首/m 个/q 新建/v 道路/n 项目/n texts , /wd 道路/n the 全长/n 公里/q ,/wd 全线 ì Identifying basic 10. 75/m time expression, location /n 设计/vn 时速/nentities 60/m 公里/小时/n ,/wd 采 and organization using shareware 用/v 一级/b 公路/n ,/wd 双向/b ICTCLAS 2010 (can 标准/n be downloaded from 6/m 车道/n ,/wd 并/cc 按/p 8/m 级/q 抗震/vn 设 http: //hi. baidu. com/drkevinzhang) 防/vn 。/wj 新华社/nt 记者/n 刘海/nr 摄/vg ì

Recognition of full entity name Typical Chinese lexical analyzer usually produces tokens with small

Recognition of full entity name Typical Chinese lexical analyzer usually produces tokens with small granularity. ì Time expression was recognized using regular grammar. ì n ì “ 12月/t 6日/t” “ 12月6日”(or Dec. 6) POI location and organization entities was combined with sequential tokens using heuristic knowledge (such as maximal match). 四 川 省 /ns 都 江 堰 市 /ns 聚 源 镇 /nsi 至/p 青城/ns 山/n 道路/n ì Error Filtering algorithm

POI extraction modeling ì Measure(f, a) = log(1+1/(Distance(f, a)+α)) +log(β+TFf) +log(β+TFa) ìa is POI

POI extraction modeling ì Measure(f, a) = log(1+1/(Distance(f, a)+α)) +log(β+TFf) +log(β+TFa) ìa is POI attribution, such as POI location, organization and time expression, recognized from the lexical analysis. ì f is the feature word of given POI event. ì Distance(f, a) is the count of words between a and f. ì TFf and TFa are the frequencies of the feature word f and POI attribution a respectively

POI extraction modeling ì POI Event feature word: the hint of a POI event.

POI extraction modeling ì POI Event feature word: the hint of a POI event.

POI extraction modeling ì only term frequency and the distance from POI location, organization

POI extraction modeling ì only term frequency and the distance from POI location, organization or time expression to the given feature word are introduced. ì It is independent of domains.

Result Optimization Firstly, after temporal inference, the outdated POI event was filtered by judging

Result Optimization Firstly, after temporal inference, the outdated POI event was filtered by judging as useless. ì Secondly, consistency and validity check would be used to filter illegal POI. In our work, any data outside the mainland of China or no specific organization will be discarded. ì

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Outline ì Motivation ì Challenges ì POI Extraction Modeling ì Experiments and Comparison

Experiments ì Web crawler via Google, Baidu and Sogou to find candidate news using

Experiments ì Web crawler via Google, Baidu and Sogou to find candidate news using POI event feature word. ì 2000 news are used for training data collection and 1000 news for testing data collection. ì Three experts give the answers.

Experiments ì precision n. P (P) and recall(R) = |C∩R| / |R| (3) n

Experiments ì precision n. P (P) and recall(R) = |C∩R| / |R| (3) n R = |C∩R| / |C| (4) n where R is the set of results returned by our system, and C is the set of manually tagged correct results.

Experiments ì BASELINE design n the result of information extraction is hard to compare

Experiments ì BASELINE design n the result of information extraction is hard to compare except for experiments with the same tasks on a given test set. n “co-reference evaluation”, with the goal of stimulating more fundamental NLP research. ì Therefore, the BASELINE experiment is designed only with lexical analysis and cooccurrence in one sentence.

Experimental Result

Experimental Result

Analysis ì Compared with BASELINE, the proposed POI extraction method achieved better performance in

Analysis ì Compared with BASELINE, the proposed POI extraction method achieved better performance in terms of both precision and recall. ì It indicated that each NLP module solved different problems and improved the performance in POI extraction. And location optimization is the most effective.

Compared with previous works ì Structured Approach Web pages with predefined and strict format,

Compared with previous works ì Structured Approach Web pages with predefined and strict format, such as B 2 B or library book page. n Extraction with predefined template. n ì Unstructured Approach free text with natural language, such as news articles, technical reports. n More dependent on natural language processing on restrained domain. n ì Semi-structured approach Intermediate structure, such as product introduction page or academic paper n utilizes heuristic-based wrappers on structured data and natural language techniques on texts n

Compared with previous works ì Separate extractor with knowledge base ì Unstructured extractor only

Compared with previous works ì Separate extractor with knowledge base ì Unstructured extractor only uses shallow parsing on natural languages, such as entities recognition. ì POI extraction model only based on term frequency and distance.

Demo

Demo

Conclusion Major objective of our work is that given a unstructured news, how to

Conclusion Major objective of our work is that given a unstructured news, how to effectively find POI data. ì We propose a novel POI extraction model incorporating full POI entities recognition, distance and frequency to measure the dependency between POI event and its attributions. ì The experimental results show that our model has a promising improvement over the baseline techniques. ì Our work has been applied in industrial production. ì

Thank you Contact Email: kevinzhang@bit. edu. cn Welcome to visit my blog http: //hi.

Thank you Contact Email: kevinzhang@bit. edu. cn Welcome to visit my blog http: //hi. baidu. com/drkevinzhang/