Introduction to Information Retrieval Chinese Keyword Extraction • Chinese keyword extraction is fundamental for many applications. • There are two major approaches – Need word segmentation 需先斷詞 – No word segmentation 不需先斷詞 2
Introduction to Information Retrieval N-gram approach • No word segmentation 不需先斷詞 • The keywords are in the subset of n-grams • How to select the proper n-grams for keywords ? – tf-idf – chi-square 卡方 – mutual information – information gain, maximum entropy, and others 3
Introduction to Information Retrieval N-gram approach with tf-idf • Enumerate n-grams, for example, 2 to 6 • Compute tf and idf • Sort by tf-idf descendingly • Remove non-keywords 移除非關鍵詞者 – Ex. 含數字或特殊字元者,不計 • Remove sub-keywords 移除子關鍵詞 – Ex. 移除林書、書豪,只保留林書豪 4
Introduction to Information Retrieval Demonstration • Use news corpus • Use different topics 5
Introduction to Information Retrieval Discussion "Basic algorithms and rich corpus can do a great job. " Use the keywords to tag every original document (as document feature to represent the document ) 8