Introduction to Information Retrieval Demonstration • Check the example Excel file 3
Introduction to Information Retrieval Discussion (1) • Why n-gram approach with tf-idf can extract Chinese keywords ? – if use only tf ? – if use only df ? – if use only idf ? 4
Introduction to Information Retrieval • tf can extract candidate terms, which may be common terms • idf can filter the common terms 5
Introduction to Information Retrieval • Exercise – longer terms are more informative – adjust the tf-idf weighting to favor the longer terms • apply the following in 保健 and 社會 topic tf-idf * len(term)n, n≧ 1
Introduction to Information Retrieval Discussion (2) • Use SQL for topic pre-selection – i. e. SELECT * FROM corpus WHERE topic LIKE '%政治%' – if use all corpus ? • What is the relationship between the topic and keywords ? 10
Introduction to Information Retrieval Topic-related Keyword • In addition to term weighting, need to consider the relevance between terms and the topic • use Mutual Information tf-idf * MI 11
Introduction to Information Retrieval • Mutual Information P : probability N : size of the corpus f(x) : the occurrences of term x in the corpus f(y) : the occurrences of term x in the corpus f(x, y) : the co-occurrences of term x and y in the corpus 12
Introduction to Information Retrieval • Mutual Information – larger MI means the tendency of co-occurrences of term x and y 值越大其共現率越高 13
Introduction to Information Retrieval 15
Introduction to Information Retrieval • using tf-idf * MI may extract topic-related keywords more precisely 16