Mining Local Gazetteers of Literary Chinese with CRF

  • Slides: 43
Download presentation
Mining Local Gazetteers of Literary Chinese with CRF and Pattern based Methods for Biographical

Mining Local Gazetteers of Literary Chinese with CRF and Pattern based Methods for Biographical Information in Chinese History Chao-Lin Liu, Chih-Kai Huang, Hongsu Wang, Peter K. Bol National Chengchi University, Taiwan Harvard University, USA 29 October 2015

Goals Enhance the CBDB contents by mining and discovering more biographical information from multiple

Goals Enhance the CBDB contents by mining and discovering more biographical information from multiple sources with computing methods � Readily extensible to social network analysis � Difangzhi for now �

Outline China Biographical Database � Local gazetteers of China � Language models � Conditional-random-field

Outline China Biographical Database � Local gazetteers of China � Language models � Conditional-random-field models � Discussions �

China Biographical Database � � CBDB (中國歷代人物傳記資料庫) URL: � � � http: //isites. harvard.

China Biographical Database � � CBDB (中國歷代人物傳記資料庫) URL: � � � http: //isites. harvard. edu/icb. do? keyword=k 16229&pageid=icb. page 76535 Short URL: http: //goo. gl/h. CUKp. R Open and free database of about 360, 000 Chinese individuals ranging between 7 th and 19 th century

CBDB Query

CBDB Query

CBDB Query (Cont’d)

CBDB Query (Cont’d)

CBDB Query Results

CBDB Query Results

CBDB Query Results (Cont’d)

CBDB Query Results (Cont’d)

Difangzhi 地方志 � Local gazetteers compiled by Chinese governments � A big collection of

Difangzhi 地方志 � Local gazetteers compiled by Chinese governments � A big collection of biographical information (and others) since 6 th century �

Word & Paragraph Boundaries & Punctuations ifyouhadascreenplaytosellorbetteryetthenextbigtec hstartuptopitchthewhitehouseseastroomwasthepla cetobeonfridaynightgatekeepersofthesilverscreena ndsiliconvalleywereoutinfullforceatthestatedinneri nhonorofchinesepresidentxijinpinggiantsoftheindus tryincludingfacebookceomarkzuckerbergandapplec eotimcookrubbedelbowswithrobertigerceoofthewa ltdisneycompanyandjeffreykatzenbergceoofdream

Word & Paragraph Boundaries & Punctuations ifyouhadascreenplaytosellorbetteryetthenextbigtec hstartuptopitchthewhitehouseseastroomwasthepla cetobeonfridaynightgatekeepersofthesilverscreena ndsiliconvalleywereoutinfullforceatthestatedinneri nhonorofchinesepresidentxijinpinggiantsoftheindus tryincludingfacebookceomarkzuckerbergandapplec eotimcookrubbedelbowswithrobertigerceoofthewa ltdisneycompanyandjeffreykatzenbergceoofdream worksanimationandwithallthecorporatetitansinatte ndanceyoudthinkthenightwouldbeallbusinessbutwh enaskedtopredictthebiggestitemontheeveningsage ndakatzenbergsaidfunihope SOURCE: WASHINTON POST: https: //www. washingtonpost. com/news/reliable-source/wp/2015/09/25/state-dinner-recap-heavy-on-silicon-valley-and-thesilver-screen/

WPBP (when recovered) If you had a screenplay to sell, or better yet the

WPBP (when recovered) If you had a screenplay to sell, or better yet the next big tech start-up to pitch, the White House’s East Room was the place to be on Friday night. Gatekeepers of the silver screen and Silicon Valley were out in full force at the state dinner in honor of Chinese President Xi Jinping. Giants of the industry, including Facebook CEO Mark Zuckerberg and Apple CEO Tim Cook, rubbed elbows with Robert Iger, CEO of the Walt Disney Company, and Jeffrey Katzenberg, CEO of Dream. Works Animation. And with all the corporate titans in attendance you’d think the night would be all business. But when asked to predict the biggest item on the evening’s agenda Katzenberg said, “Fun. I hope. ” SOURCE: WASHINTON POST: https: //www. washingtonpost. com/news/reliable-source/wp/2015/09/25/state-dinner-recap-heavy-on-silicon-valley-and-thesilver-screen/

CNGRAM Procedure

CNGRAM Procedure

Annotations with CBDB info 陳瑜字仲庸雷州人廣西中書省都事

Annotations with CBDB info 陳瑜字仲庸雷州人廣西中書省都事

CNGRAM Procedure

CNGRAM Procedure

Language-Models ngrams � Examples � <NAME><ADDRESS><REIGN PERIOD><ENTRY> � <NAME><ADDRESS><ENTRY><REIGN PERIOD> � <NAME><ADDRESS><OFFICE> �

Language-Models ngrams � Examples � <NAME><ADDRESS><REIGN PERIOD><ENTRY> � <NAME><ADDRESS><ENTRY><REIGN PERIOD> � <NAME><ADDRESS><OFFICE> �

Extracting Annotated Strings <name><address><office> 陳瑜字仲庸雷州人廣西中書省都事 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書省都事</office> 陳瑜字仲庸雷州人廣西中書 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書</office>

Extracting Annotated Strings <name><address><office> 陳瑜字仲庸雷州人廣西中書省都事 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書省都事</office> 陳瑜字仲庸雷州人廣西中書 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書</office>

Extraction & Filter Patterns <name> 字 Z 1 Z 2 <address> 陳瑜字仲庸雷州人廣西中書省都事 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書省都事</office>

Extraction & Filter Patterns <name> 字 Z 1 Z 2 <address> 陳瑜字仲庸雷州人廣西中書省都事 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書省都事</office> 陳瑜(Yuan) 雷州 廣西 中書省都事(Yuan) Yuan, 陳瑜, 仲庸

Extraction & Filter Patterns <name> 字 Z 1 Z 2 <address> 陳瑜字仲庸雷州人廣西中書 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書</office>

Extraction & Filter Patterns <name> 字 Z 1 Z 2 <address> 陳瑜字仲庸雷州人廣西中書 <name>陳瑜</name>字仲庸<addr>雷州<addr>人 <addr>廣西</addr><office>中書</office> 陳瑜(Yuan) 雷州 廣西 中書(Yuan) Yuan, 陳瑜, 仲庸

Evaluation: Results (1 Oct. 2015) Type Dynasty Name 1 ○ ○ Style Name ○

Evaluation: Results (1 Oct. 2015) Type Dynasty Name 1 ○ ○ Style Name ○ 2 ○ ○ 3 × 4 Quan. Prop. 609 44. 6% × 665 43. 2% ○ ○ 117 3. 17% ○ × ○ 262 2. 46% 5 × ○ × 220 2. 30% 6 × × ○ 45 1. 59% 7 ○/× × × 234 10. 87%

3765 Records Type Name Style Name Quan. Prop. 1 ○ ○ 1192 31. 66%

3765 Records Type Name Style Name Quan. Prop. 1 ○ ○ 1192 31. 66% 2 ○ × 885 23. 51% 3 × ○ 1104 29. 32% 4 × × 584 15. 51%

Self Comparison Type Dynasty Name Style Name Quan. Prop. 1 ○ ○ ○ 609

Self Comparison Type Dynasty Name Style Name Quan. Prop. 1 ○ ○ ○ 609 44. 6% 2 ○ ○ × 665 43. 2% 3 × ○ ○ 117 3. 17% 4 ○ × ○ 262 2. 46% 5 × ○ × 220 2. 30% 6 × × ○ 45 1. 59% 7 ○/× × × 234 10. 87% Type Name Style Name Quan. Prop. 1 ○ ○ 1192 31. 66% 2 ○ × 885 23. 51% 3 × ○ 1104 29. 32% 4 × × 584 15. 51%

Conditional Random Field Models A machine learning approach � Goal: predicting the class for

Conditional Random Field Models A machine learning approach � Goal: predicting the class for a character � Given: the character itself and the labels (features) for surrounding characters � Data � Training data: 110000 records extracted from the gazetteers with regular expressions (1. 498 million characters) � Test data: unlabeled raw gazetteer texts (900 thousand characters) �

CRF Models (Cont’d) characters � Tool: MALLET CRF (Univ. of Massachusetts) � Classes �

CRF Models (Cont’d) characters � Tool: MALLET CRF (Univ. of Massachusetts) � Classes � NB for name begin; NI for name interior; NE for name end � AB for address begin; AI for address interior; AE for address end � O for others �

Features for CRF models Group Types 1 2 Chinese characters 3 4 relative positions

Features for CRF models Group Types 1 2 Chinese characters 3 4 relative positions of selected named entities usage 5 6 usage named entities Description self surrounding k characters office, entry, reign period, and time used in person or location name family name? office, entry, reign period, and time

Feature Values for CRF Models 陳瑜字仲庸雷州人廣西中書省都事 Group 1 2 3 4 5 Types Chinese

Feature Values for CRF Models 陳瑜字仲庸雷州人廣西中書省都事 Group 1 2 3 4 5 Types Chinese characters Description self surrounding k characters relative positions of office, entry, reign selected named entities period, and time usage used in person or location name usage family name? Feature values 州 瑜, 字, 仲, 庸, 雷, 人, 廣, 西, 中, 書 office. Right@3 A probability (discretized) No

Evaluation (individual character) � 5 -fold cross validation O NB NI NE AB AI

Evaluation (individual character) � 5 -fold cross validation O NB NI NE AB AI AE Group 1+2 Prec. Recall F 1 0. 97 0. 94 0. 95 0. 85 0. 94 0. 89 0. 86 0. 91 0. 88 0. 82 0. 92 0. 87 0. 85 0. 86 0. 71 0. 84 0. 77 0. 85 0. 86 Group 1+2+4+5+6 Prec. Recall F 1 0. 97 0. 93 0. 95 0. 94 0. 93 0. 91 0. 93 0. 92 0. 91 0. 89 0. 90 0. 83 0. 89 0. 86 0. 91 0. 89 0. 90

Evaluation (17914 names) 1800 instances in all zones except the last scores checking the

Evaluation (17914 names) 1800 instances in all zones except the last scores checking the first 100 samples in each zone correct expt’d 1 97 1746 6 70 1260 2 88 1584 7 77 1386 3 90 1620 8 69 1242 4 81 1458 9 59 1062 5 79 1422 10 59 1011

Names and Addresses

Names and Addresses

Paragraph Boundary Identification Many paragraphs start with person names and other “signals” � Paragraph

Paragraph Boundary Identification Many paragraphs start with person names and other “signals” � Paragraph boundary identification possible � Identifying paragraph boundaries is tentative to finding “owners” of the paragraphs � This in turn lead to building social networks amount persons �

Illustrations ○ C 1 C 2 字 Z 1 Z 2 ○ C 1

Illustrations ○ C 1 C 2 字 Z 1 Z 2 ○ C 1 C 2 C 3 字 Z 1 Z 2

Current Results 56. 3% using any consecutive markers � 73. 0%, if the markers

Current Results 56. 3% using any consecutive markers � 73. 0%, if the markers are correct �

Concluding Remarks Language-model based methods and machine learning methods useful for extracting biographical information

Concluding Remarks Language-model based methods and machine learning methods useful for extracting biographical information from literary Chinese � Yet, the results are not perfect � Will extend the current work for mining social networks in historical documents �

Thank you & Questions Language-model based methods and machine learning methods useful for extracting

Thank you & Questions Language-model based methods and machine learning methods useful for extracting biographical information from literary Chinese � Yet, the results are not perfect � Will extend the current work for mining social networks in historical documents �