A twostage feature selection method for text categorization

















- Slides: 17
A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm Presenter : YU-TING LU Authors : Harun Ug˘uz 2011. KBS Intelligent Database Systems Lab
Outlines n Motivation n Objectives n Methodology n Experiments n Conclusions n Comments Intelligent Database Systems Lab
Motivation • A major problem of text categorization is its large number of features. • Most of those are irrelevant noise that can mislead the classifier. Intelligent Database Systems Lab
Objectives • Two-stage feature selection and feature extraction is used to improve the performance of text categorization. Intelligent Database Systems Lab
Methodology Intelligent Database Systems Lab
Methodology – pre-processing – removing of stop-words a, and, because, can, do, every, the… – Stemming computer, computing, computation, computes comput – term weighting – pruning of the words prune the words that appear less than two times in the documents. Terms of the document collection documents Intelligent Database Systems Lab
Methodology – feature ranking with information gain • each term within the text is ranked depending on their importance for the classification in decreasing order using the IG method. Intelligent Database Systems Lab
Methodology – dimension reduction methods • principal component analysis p≦m • Genetic algorithm for feature selection 11011 00110 01110 11110 Individual’s encoding Mutation Fitness function Selection Crossover Intelligent Database Systems Lab
Methodology – text categorization methods • KNN classifier • C 4. 5 decision tree classifier Intelligent Database Systems Lab
Methodology – evaluation of the performance precision recall F-measure Intelligent Database Systems Lab
Experiments – datasets – Reuters dataset-21578 Category name Number of document Earn 3743 Acquisition 2179 Money-fx 633 Crude 561 Grain 542 Trade 500 – Classic 3 dataset Category name Number of document CRANFIELD 1398 MEDLINE 1033 CISI 1460 Intelligent Database Systems Lab
Experiments – Reuters-21578 A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing. Intelligent Database Systems Lab
Experiments – Reuters-21578 Intelligent Database Systems Lab
Experiments – Classic 3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing. Intelligent Database Systems Lab
Experiments – Classic 3 Intelligent Database Systems Lab
Conclusions • The success of text categorization performed through the C 4. 5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG. • Two-stage feature selection methods can improve the performance of text categorization. Intelligent Database Systems Lab
Comments • Advantages - understand the basic methods • Applications - text categorization Intelligent Database Systems Lab