Automatic Keyword Extraction from Documents Using Conditional Random

  • Slides: 17
Download presentation
Automatic Keyword Extraction from Documents Using Conditional Random Fields ZHANG Chengzhi 1, 2, WANG

Automatic Keyword Extraction from Documents Using Conditional Random Fields ZHANG Chengzhi 1, 2, WANG Huilin 1 1. Institute of Scientific & Technical Information of China 2. Department of Information Management, Nanjing University of Science & Technology

Background • The task of automatic keyword extraction (AKE) • A large number of

Background • The task of automatic keyword extraction (AKE) • A large number of documents do not have keywords • Existing methods on keyword extraction can not use most of the features in the document • Conditional Random Fields (CRF) model is a state-of-the-art sequence labeling method • Keyword Extraction is a Typical Labeling Problem

Related Work • Simple Statistics Approaches – N-Gram, word frequency, TF*IDF, word co-occurrences, PAT-tree

Related Work • Simple Statistics Approaches – N-Gram, word frequency, TF*IDF, word co-occurrences, PAT-tree • Linguistics Approaches – lexical analysis, syntactic analysis, discourse analysis • Machine Learning Approaches – Methods: Naïve Bayes, SVM, Bagging – Tools: KEA, Gen. Ex • Other Approaches – heuristic knowledge , hybrid methods

CRF-based Keyword Extraction Observation sequential data: X(X 1 X 2…Xn), Corresponding status labels: Y(Y

CRF-based Keyword Extraction Observation sequential data: X(X 1 X 2…Xn), Corresponding status labels: Y(Y 1 Y 2…Yn), P(Y|X)= Y*= arg P(Y|X)

Features in the CRF Model • 贸易投资一体化与就业增长——以江苏省为案例的实证分析 • 贸易投资/n 一体化/vn 与/c 就业增长/n ——/w 以/p

Features in the CRF Model • 贸易投资一体化与就业增长——以江苏省为案例的实证分析 • 贸易投资/n 一体化/vn 与/c 就业增长/n ——/w 以/p 江苏省/ns 为/p 案例 /n 的/uj 实证分析/n’ • • ‘贸易投资/KW_B 一体化/KW_I 与/KW_S 就业增长/KW_N ——/KW_S 以/KW_S 江苏省/KW_N 为/KW_S 案例/KW_N 的/KW_S 实证分析 /KW_Y’, KW_B : current word is at the beginning of a keyword, KW_I : current word is one part, but not at the begging of a keyword, , KW_S : current word is not a word in the Stop. List KW_N: current word is neither a keyword nor a word in the Stop. List, KW_Y: current word is a keyword.

Features in the CRF Model (Con’t)

Features in the CRF Model (Con’t)

Training Data for CRF Model

Training Data for CRF Model

Process of the CRF-based AKE (1) Preprocessing and features extraction (2) CRF model training

Process of the CRF-based AKE (1) Preprocessing and features extraction (2) CRF model training (3) CRF labeling and keyword extraction (4) Results evaluation

Experimental Results--Data Sets • documents from http: //art. zlzx. org. • randomly chose 600

Experimental Results--Data Sets • documents from http: //art. zlzx. org. • randomly chose 600 academic Documents • the annotated keywords of 600 documents ranges from 5 to 10 and the average of annotated keywords is 7. 83 per document.

Experimental Results--Evaluation Measures (3)

Experimental Results--Evaluation Measures (3)

Experimental Results- Other AKE Approaches • • • SVM multiple linear regression (MLR) logistic

Experimental Results- Other AKE Approaches • • • SVM multiple linear regression (MLR) logistic regression (Logit) Basa. Line 1: TF*IDF * Len Base. Line 2: TF*IDF * Len* DEP

Experimental Results and Discussions • Performance Evaluation of Six Models

Experimental Results and Discussions • Performance Evaluation of Six Models

Experimental Results and Discussions (Con’t) • Lexicon Dictionary Size Influence on AKE

Experimental Results and Discussions (Con’t) • Lexicon Dictionary Size Influence on AKE

Experimental Results and Discussions (Con’t) • Training Set Size Influence on AKE

Experimental Results and Discussions (Con’t) • Training Set Size Influence on AKE

Error Analysis • (1) Errors in the Training Set • ‘牧民(herdsman)’ and ‘牧户 (makido)’

Error Analysis • (1) Errors in the Training Set • ‘牧民(herdsman)’ and ‘牧户 (makido)’ • (2) Ambiguity of the extracted keywords

Conclusion and Future Work • the CRF model is a good model in the

Conclusion and Future Work • the CRF model is a good model in the task of AKE • apply the AKE approach on Web pages, Email and others non-academic documents

Thank you! E-mail: zhangchz@istic. ac. cn

Thank you! E-mail: zhangchz@istic. ac. cn