Automatic Keyword Extraction from Documents Using Conditional Random

















- Slides: 17
Automatic Keyword Extraction from Documents Using Conditional Random Fields ZHANG Chengzhi 1, 2, WANG Huilin 1 1. Institute of Scientific & Technical Information of China 2. Department of Information Management, Nanjing University of Science & Technology
Background • The task of automatic keyword extraction (AKE) • A large number of documents do not have keywords • Existing methods on keyword extraction can not use most of the features in the document • Conditional Random Fields (CRF) model is a state-of-the-art sequence labeling method • Keyword Extraction is a Typical Labeling Problem
Related Work • Simple Statistics Approaches – N-Gram, word frequency, TF*IDF, word co-occurrences, PAT-tree • Linguistics Approaches – lexical analysis, syntactic analysis, discourse analysis • Machine Learning Approaches – Methods: Naïve Bayes, SVM, Bagging – Tools: KEA, Gen. Ex • Other Approaches – heuristic knowledge , hybrid methods
CRF-based Keyword Extraction Observation sequential data: X(X 1 X 2…Xn), Corresponding status labels: Y(Y 1 Y 2…Yn), P(Y|X)= Y*= arg P(Y|X)
Features in the CRF Model • 贸易投资一体化与就业增长——以江苏省为案例的实证分析 • 贸易投资/n 一体化/vn 与/c 就业增长/n ——/w 以/p 江苏省/ns 为/p 案例 /n 的/uj 实证分析/n’ • • ‘贸易投资/KW_B 一体化/KW_I 与/KW_S 就业增长/KW_N ——/KW_S 以/KW_S 江苏省/KW_N 为/KW_S 案例/KW_N 的/KW_S 实证分析 /KW_Y’, KW_B : current word is at the beginning of a keyword, KW_I : current word is one part, but not at the begging of a keyword, , KW_S : current word is not a word in the Stop. List KW_N: current word is neither a keyword nor a word in the Stop. List, KW_Y: current word is a keyword.
Features in the CRF Model (Con’t)
Training Data for CRF Model
Process of the CRF-based AKE (1) Preprocessing and features extraction (2) CRF model training (3) CRF labeling and keyword extraction (4) Results evaluation
Experimental Results--Data Sets • documents from http: //art. zlzx. org. • randomly chose 600 academic Documents • the annotated keywords of 600 documents ranges from 5 to 10 and the average of annotated keywords is 7. 83 per document.
Experimental Results--Evaluation Measures (3)
Experimental Results- Other AKE Approaches • • • SVM multiple linear regression (MLR) logistic regression (Logit) Basa. Line 1: TF*IDF * Len Base. Line 2: TF*IDF * Len* DEP
Experimental Results and Discussions • Performance Evaluation of Six Models
Experimental Results and Discussions (Con’t) • Lexicon Dictionary Size Influence on AKE
Experimental Results and Discussions (Con’t) • Training Set Size Influence on AKE
Error Analysis • (1) Errors in the Training Set • ‘牧民(herdsman)’ and ‘牧户 (makido)’ • (2) Ambiguity of the extracted keywords
Conclusion and Future Work • the CRF model is a good model in the task of AKE • apply the AKE approach on Web pages, Email and others non-academic documents
Thank you! E-mail: zhangchz@istic. ac. cn