PatTreeBased Adaptive keyphrase Extraction for Intelligent Chinese Information

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval 出處：institute of information science , academia sinica , taipei, taiwan, R. O. C. 學生：陳道輝、周鉦琪、葉飛指導老師：黃三益教授

Abstract • PAT-tree-based adaptive approach • IR application: automatic term suggestion, domain-specific lexicon construction, book indexing and document classification

Introduction • Keyphrase (keywords) extraction in Chinese language is a critical problem because of difficulties in word segmentation and unknown word identification. ex(哈電族)

Definition of the Problems • Lexical pattern: a string that consists of more than one successive character and has certain occurrences in a text collection with a specific domain. • For example: 關鍵詞抽取 • LPs: 關鍵、建詞、詞抽、抽取、關鍵詞、鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、關鍵詞抽取

Definition of the Problems (cont) • Complete lexical pattern: a LP with a complete meaning and lexical boundaries in semantics. • For example: 關鍵詞抽取 • CLP: 關鍵、抽取、關鍵詞抽取

Definition of the Problems (cont) • Significant lexical pattern: A CLP which is either “specific” or “significant” in the database • For example: 關鍵詞抽取 • SLP: 關鍵詞、關鍵詞抽取

Definition of the Problems (cont) • Definition 1: SLP Extraction Problem • Definition 2: CLP Estimation Problem • To solve problem 1, first we should solve problem 2

Definition of the Problems (cont) • Proposed Approach: 3 modules – Text analysis and PAT-tree indexing module – CLP extraction module – SLP extraction module

Definition of the Problems (cont)

Estimation of CLP • Most CLP have strong associations between • their composed and overlapped substrings Association Norm Estimation function • If AE is large, it can be found that in many cases, patterns y and z will occur together is the text collection (關鍵詞抽取、關鍵詞抽)

Estimation of CLP (cont) • It’s not enough to check if x has complete lexical • • boundaries using AE (關鍵詞) To overcome this, we use two additional metrics, LCD (left context dependency) and RCD(right context dependency) ex. 李登輝 By these metrics we can say: – X is a CLP iff it has no LCD and RCD, and AE > (t 3) threshold

Estimation of CLP (cont) • X has LCD if |L|<t 1, or MAX z (f(zx)/f(x))>t 2, where t 1, t 2 are threshold values , z E L and |L| means the number of unique right adjacent characters of x • X has RCD if |L|<t 1, or MAX z f(xy)/f(x)>t 2, where t 1, t 2 are threshold values , y E L and |L|means the number of unique right adjacent characters of x

Text Analysis and PAT-Tree Indexing • PAT tree uses as primarily implementation • • structure, and used for text retrieval and keyphrase extraction Use delimiter(, “ ”. ) to determine a segment boundary, then build semi-infinite string For example: 個人電腦, 人腦 – 個人電腦, 電腦, 腦, 人腦, 腦 • Node information (comparison bit, external • • nodes, frequency) PAT Is easy for prefix search. IPAT is easy for postfix search.

Text Analysis and PAT-Tree Indexing (cont) • Convert semi-infinite strings to bits • According semi-infinite strings’ bit sequences and differences to build PAT Tree • We also create inverse PAT tree for inverse data streams of the database to check the occurrences of LSs and RSs • (詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進)

Text Analysis and PAT-Tree Indexing (cont) • Why use Pat tree (patricia)？ – Log key value comparison times is low. – Computing time and space is down. – Efficient search. – We can use Pat tree to check RCD. – We can use Inverse Pat tree to check LCD.

Extraction of SLP • A CLP is not always a SLP – It cannot prove its significance in the text collection – Many CLP are commonly found in daily use • All CLP is checked against a set of lexical rules • and a general-domain corpus Rules: – Numbers, Adverbs, Timing-related Terms – General Domain Pat Tree vs Specific Domain Pat Tree.

Evaluation • Extraction of SLP – Ask 3 people to select CLPs and keyphrases from 50 “seed sentence” – Use these test data to test accuracy of SLP extraction Phrase length Total Number of Extracted Keyphrases Number of Correct Keyphrases Extracted Precision 2 3568 3311 92. 8% 3 1130 661 58. 5% 4 999 687 68. 77% 5 207 150 72. 46% >=6 178 151 84. 83% Total 6082 4960 81. 55%

Evaluation (cont) • Speed and Space Requirements Corpus size (KB) PAT Tree size (KB) Time to construct PAT tree (sec) Time to extract keyphrases (sec) C 1 -O(10 k) 12 77 0. 19 0. 01 C 2 -O(100 k) 127 670 2. 82 0. 02 C 3 -O(1 M) 1033 4687 25. 52 1. 62 C 4 -O(10 M) 10048 44312 306. 32 28. 51 C 5 -O(100 M) 107333 439087 2381 283

Conclusion • This method reduced the difficulty of keyphrase extraction in Chinese, with better performance

0 節點號碼 2 4 6 8 9 1 1 個人電腦，人腦 Semi-infinite strings String Bit 1 9 17 25 個人電腦/節點 0 101011010011 10100100 … 人電腦/節點 2 101001000 10111001 … 電腦/節點 4 10111001 01110001 0000 … 腦/節點 6 10111000 00000000 … 人腦/節點 9 101001000 0000 … 腦/節點 6 10111000 00000000 …