Introduction to Information Retrieval Homework 1 TFIDF wyangntu

  • Slides: 16
Download presentation
Introduction to Information Retrieval Homework 1 : TF-IDF 作業討論 楊立偉教授 wyang@ntu. edu. tw ©

Introduction to Information Retrieval Homework 1 : TF-IDF 作業討論 楊立偉教授 wyang@ntu. edu. tw © Copyright 2016 1

Introduction to Information Retrieval Demonstration • Check the example Excel file 3

Introduction to Information Retrieval Demonstration • Check the example Excel file 3

Introduction to Information Retrieval Discussion (1) • Why n-gram approach with tf-idf can extract

Introduction to Information Retrieval Discussion (1) • Why n-gram approach with tf-idf can extract Chinese keywords ? – if use only tf ? – if use only df ? – if use only idf ? 4

Introduction to Information Retrieval • tf can extract candidate terms, which may be common

Introduction to Information Retrieval • tf can extract candidate terms, which may be common terms • idf can filter the common terms 5

Introduction to Information Retrieval • Exercise – longer terms are more informative – adjust

Introduction to Information Retrieval • Exercise – longer terms are more informative – adjust the tf-idf weighting to favor the longer terms • apply the following in 保健 and 社會 topic tf-idf * len(term)n, n≧ 1

Introduction to Information Retrieval Discussion (2) • Use SQL for topic pre-selection – i.

Introduction to Information Retrieval Discussion (2) • Use SQL for topic pre-selection – i. e. SELECT * FROM corpus WHERE topic LIKE '%政治%' – if use all corpus ? • What is the relationship between the topic and keywords ? 10

Introduction to Information Retrieval Topic-related Keyword • In addition to term weighting, need to

Introduction to Information Retrieval Topic-related Keyword • In addition to term weighting, need to consider the relevance between terms and the topic • use Mutual Information tf-idf * MI 11

Introduction to Information Retrieval • Mutual Information P : probability N : size of

Introduction to Information Retrieval • Mutual Information P : probability N : size of the corpus f(x) : the occurrences of term x in the corpus f(y) : the occurrences of term x in the corpus f(x, y) : the co-occurrences of term x and y in the corpus 12

Introduction to Information Retrieval • Mutual Information – larger MI means the tendency of

Introduction to Information Retrieval • Mutual Information – larger MI means the tendency of co-occurrences of term x and y 值越大其共現率越高 13

Introduction to Information Retrieval 15

Introduction to Information Retrieval 15

Introduction to Information Retrieval • using tf-idf * MI may extract topic-related keywords more

Introduction to Information Retrieval • using tf-idf * MI may extract topic-related keywords more precisely 16