Introduction to Information Retrieval Homework 1 TFIDF wyangntu

  • Slides: 8
Download presentation
Introduction to Information Retrieval Homework 1 : TF-IDF 楊立偉教授 台灣科大資管系 wyang@ntu. edu. tw ©

Introduction to Information Retrieval Homework 1 : TF-IDF 楊立偉教授 台灣科大資管系 wyang@ntu. edu. tw © Copyright 2016 by Willie Yang 1

Introduction to Information Retrieval Chinese Keyword Extraction • Chinese keyword extraction is fundamental for

Introduction to Information Retrieval Chinese Keyword Extraction • Chinese keyword extraction is fundamental for many applications. • There are two major approaches – Need word segmentation 需先斷詞 – No word segmentation 不需先斷詞 2

Introduction to Information Retrieval N-gram approach • No word segmentation 不需先斷詞 • The keywords

Introduction to Information Retrieval N-gram approach • No word segmentation 不需先斷詞 • The keywords are in the subset of n-grams • How to select the proper n-grams for keywords ? – tf-idf – chi-square 卡方 – mutual information – information gain, maximum entropy, and others 3

Introduction to Information Retrieval N-gram approach with tf-idf • Enumerate n-grams, for example, 2

Introduction to Information Retrieval N-gram approach with tf-idf • Enumerate n-grams, for example, 2 to 6 • Compute tf and idf • Sort by tf-idf descendingly • Remove non-keywords 移除非關鍵詞者 – Ex. 含數字或特殊字元者,不計 • Remove sub-keywords 移除子關鍵詞 – Ex. 移除林書、書豪,只保留林書豪 4

Introduction to Information Retrieval Demonstration • Use news corpus • Use different topics 5

Introduction to Information Retrieval Demonstration • Use news corpus • Use different topics 5

Introduction to Information Retrieval Discussion "Basic algorithms and rich corpus can do a great

Introduction to Information Retrieval Discussion "Basic algorithms and rich corpus can do a great job. " Use the keywords to tag every original document (as document feature to represent the document ) 8