Project 1 Text Classification by Neural Networks Ver

  • Slides: 16
Download presentation
Project 1: Text Classification by Neural Networks Ver 1. 1

Project 1: Text Classification by Neural Networks Ver 1. 1

Outline l Classification using ANN ¨ Learn and classify text documents ¨ Estimate several

Outline l Classification using ANN ¨ Learn and classify text documents ¨ Estimate several statistics on the dataset (C) 2006, SNU Biointelligence Laboratory 2

Network Structure Class 1 Input Class 2 … Class 3 (C) 2006, SNU Biointelligence

Network Structure Class 1 Input Class 2 … Class 3 (C) 2006, SNU Biointelligence Laboratory 3

CLASSIC 3 Dataset

CLASSIC 3 Dataset

CLASSIC 3 l Three categories: 3891 documents ¨ CISI: 1, 460 document abstracts on

CLASSIC 3 l Three categories: 3891 documents ¨ CISI: 1, 460 document abstracts on information retrieval from Institute of Scientific Information. ¨ CRAN: 1, 398 document abstracts on Aeronautics from Cranfield Institute of Technology. ¨ MED: 1, 033 biomedical abstracts from MEDLINE. (C) 2006, SNU Biointelligence Laboratory 5

Text Presentation in Vector Space 문서집합 stemming stop-words elimination feature selection . . .

Text Presentation in Vector Space 문서집합 stemming stop-words elimination feature selection . . . VSM representation 1 0 0 0 2 0 0 1 Term vectors 1 baseball specs graphics hockey unix space 0 1 0 0 0 3 0 0 0 2 1 0 0 0 1 Bag-of-Words representation dn d 1 d 2 d 3 1 0 0 1 2 1 0 3 1 1 0 0 1 2 0 0 0 1 0 0 2 0 0 0 1 1 3 1 0 0 0 0 0 1 0 0 3 0 0 2 1 0 1 Term-document matrix 0 0 1 0 0 0 1 3 0 1 0 0 Dataset Format (C) 2006, SNU Biointelligence Laboratory 6

Dimensionality Reduction term (or feature) vectors individual feature scores Sort by score Scoring measure

Dimensionality Reduction term (or feature) vectors individual feature scores Sort by score Scoring measure (on individual feature) choose terms with higher values documents in vector space Term Weighting ML algorithm TF or TF x IDF TF: term frequency IDF: Inverse Document Frequency N: Number of documents ni: number of documents that contain the j-th word (C) 2006, SNU Biointelligence Laboratory 7

Construction of Document Vectors l Controlled vocabulary ¨ Stopwords are removed ¨ Stemming is

Construction of Document Vectors l Controlled vocabulary ¨ Stopwords are removed ¨ Stemming is used. ¨ Words of which document frequency is less than 5 is removed. Term size: 3, 850 l A document is represented with a 3, 850 -dimensional vector of which elements are the frequency of words. ¨ Words are sorted according to their values of information gain. Top 100 terms are selected 3, 830 (examples) x 100 (terms) matrix (C) 2006, SNU Biointelligence Laboratory 8

Experimental Results

Experimental Results

Data Setting for the Experiments l Basically, training and test set are given. ¨

Data Setting for the Experiments l Basically, training and test set are given. ¨ Training : 2, 683 examples ¨ Test : 1, 147 examples l N-fold cross-validation (Optional) ¨ Dataset is divided into N subsets. ¨ The holdout method is repeated N times. < Each time, one of the N subsets is used as the test set and the other (N-1) subsets are put together to form a training set. ¨ The average performance across all N trials is computed. (C) 2006, SNU Biointelligence Laboratory 10

Number of Epochs (C) 2006, SNU Biointelligence Laboratory 11

Number of Epochs (C) 2006, SNU Biointelligence Laboratory 11

Number of Hidden Units l Number of Hidden Units ¨ Minimum 10 runs for

Number of Hidden Units l Number of Hidden Units ¨ Minimum 10 runs for each setting # Hidden Train Test Units Average SD Best Worst Setting 1 Setting 2 Setting 3 (C) 2006, SNU Biointelligence Laboratory 12

(C) 2006, SNU Biointelligence Laboratory 13

(C) 2006, SNU Biointelligence Laboratory 13

Other Methods/Parameters Normalization method for input vectors l Class decision policy l Learning rates

Other Methods/Parameters Normalization method for input vectors l Class decision policy l Learning rates l …. l (C) 2006, SNU Biointelligence Laboratory 14

ANN Sources l Source codes ¨ Free software Weka ¨ NN libraries (C, C++,

ANN Sources l Source codes ¨ Free software Weka ¨ NN libraries (C, C++, JAVA, …) ¨ MATLAB tool box l Web sites ¨ http: //www. cs. waikato. ac. nz/~ml/weka/ ¨ http: //www. faqs. org/faqs/ai-faq/neural-nets/part 5/ (C) 2006, SNU Biointelligence Laboratory 15

Submission Due date: October 12 (Thur) l Both ‘hardcopy’ and ‘email’ l ¨ Used

Submission Due date: October 12 (Thur) l Both ‘hardcopy’ and ‘email’ l ¨ Used software and running environments ¨ Experimental results with various parameter settings ¨ Analysis and explanation about the results in your own way ¨ FYI, it is not important to achieve the best performance (C) 2006, SNU Biointelligence Laboratory 16