Thought Yelp data On MultiTier Sentiment Analysis using
































- Slides: 32

Thought Ø Yelp data

On Multi-Tier Sentiment Analysis using Supervised Machine Learning Yan Zhu

Agenda Ø Overview Ø Objective Ø Multi-tier Classification Architecture Ø Experiments And Results Ø Conclusion

Overview Ø Text mining is an important area of data mining. Ø Large sets of text data are transformed into numerical values and are linked with knowledge database, in order to provide summary of words, determine similarities Ø In this way, text mining is commonly used to help representing hidden patterns and making relevant information available to various research interests

Overview Ø Classification of large sets of text data is in high demand. Ø Supervised machine learning takes a classified input data with labels from various sources, and trains the data using a machine learning model. Ø Based on the built model, labels are predicted for new incoming text data.

Overview Ø Sentiment analysis is an important branch of text analysis. It is the field of study that analyzes sentiments and emotions. Ø The analysis finds the mood of people, and uses natural language processing and data mining techniques to classify the sentiment. Ø There are different levels of sentiment analysis, including document, sentence, and aspect levels.

Objective Ø This paper proposes a multi-tier classification architecture for sentimental analysis. Ø The proposed architecture is implemented using four classifiers: Naïve Bayes, SVM (Support Vector Machine), Random Forest, and SGD (Stochastic Gradient Descent). Ø Each movie review is classified into one of the five levels: Very negative, neural, positive, and very positive.

Multi-tier Classification Architecture Ø The classification system consists of seven modules (stages): Ø Data Collection & Cleaning Ø Data pre-processing Ø Training data Ø Feature Selection Ø Training the classifier (with prediction model, classifier) Ø Test set Ø Evaluation measures

Multi-tier Classification Architecture

Multi-tier Classification Architecture Ø Data Collection & Cleaning Ø The data must be cleaned to avoid useless or meaningless data to be processed in further stages.

Multi-tier Classification Architecture Ø Data Pre-Processing Ø From the data prepared in the previous stage we have to organize and partition the datasets, which can then be used for training the model and testing the data Ø In most of the machine learning preprocessing methods the train set and test set are separated using 80 -20 rule, where 80% of data is used for training and 20% for test set.

Multi-tier Classification Architecture Ø Data Pre-Processing Ø Initially we have split data based on the unique id, but the ratio of classified labels is uneven, over fitting and cannot be sufficient for training the model, which may result in unreliable accuracy. Ø Use Hive bucket split method to extract random samples from each class. Ø The data is loaded to Hadoop database and Hive queries are used to split the data into equal proportions.

Multi-tier Classification Architecture Ø Feature Selection Ø Feature selection is the process of selecting the features Ø When relevant features are successfully selected, accuracy is improved. Ø Training time is also reduced after removing all the unwanted features. Ø Over-fitting is also decreased when noise and redundant data are removed.

Multi-tier Classification Architecture Ø Feature Selection Ø Tokenization: break the sentences into meaningful words and phrases Ø Stop word removal: remove stop words to reduce noise in the data Ø Stemmers: remove the suffix from a word. Ø Using N grams features: n words are considered in a given instance. Ø Parts of Speech Tagging(POS): classify words into their part-of-speech and label them according to the tagset.

Multi-tier Classification Architecture Ø Prediction Model Ø The prediction model is a vital part of the architecture for sentiment classification. Ø It is used to consider the labeled data and train the classifier. Based on the trained model, the test set is used to predict each review instance and labels are applied to the unclassified test data.

Multi-tier Classification Architecture Ø Prediction Model Ø In the single-tier prediction model, all the labeled data is considered as a single tier and the entire data is trained Ø Classified into five sentiment levels, 0 -4, as listed below. 0: Very Negative, 1: Negative, 2: Neutral, 3: Positive, and 4: Very Positive.

Multi-tier Classification Architecture

Multi-tier Classification Architecture Ø Prediction Model Ø The proposed architecture consists of three models Ø Model-1: Used to train the classifier using whole reviews data but the data is labelled as 3 classes. Ø Negative {1} and Very Negative {0} labeled as class ‘ 0’. Ø Neutral {2} labeled as class ‘ 2’. Ø Positive {3} and Very Positive {4} labeled as class ‘ 4’.

Multi-tier Classification Architecture Ø Prediction Model Ø Model-2: Used to train the classifier using trainset consisting of Negative and Very Negative labels. Ø The test instance provided to this model are only negative reviews classified by model-1. Ø Custom dictionaries that distinguish negative from very negative are used.

Multi-tier Classification Architecture Ø Prediction Model Ø Model-3: Used to train the classifier using trainset consisting of Positive and Very Positive labels. Ø The test instance provided to this model are only positive reviews classified by model-1.

Multi-tier Classification Architecture Ø This two-tier approach helps in improving the classifier accuracy as the classification task complexity reduces. Ø The probability of finding the correct labels increases as now each model is trained with less, and more homogeneous, more focused dataset.

Experiments And Results Ø The Experiment Data and Features Total Size Training set Test set 150, 000 120, 000 30, 000 Ø 13, 000 unique words considered as features

Experiments And Results Ø Results of Single-Tier Approach

Experiments And Results Ø Multi Tier Approach Ø Naive Bayes using Mahout

Experiments And Results Ø Multi Tier Approach Ø SVM classifier using svmlight Ø Use c & ε parameters to tune the model and results are improved with new architecture, as compared to single tier approach. Ø This classifier gained an accuracy of 81. 27% using the multi-tier architecture. This is an improvement over 7%.

Experiments And Results Ø Multi Tier Approach Ø Random Forest using Scikit Learn

Experiments And Results Ø Multi Tier Approach Ø SGD classifier using Scikit Learn

Experiments And Results Ø Multi Tier Approach Ø More Experiments by adding custom dictionaries Ø Collect most frequently occurred words from labels (negative, very negative) to accurately distinguish negative from very negative. Ø Similarly, for Model 3.

Experiments And Results Ø Multi Tier Approach vs Single Tier Approach Ø The multi-tier architecture is able to significantly improve the accuracy. Ø Among these four methods, SGD Classifier with Scikit learn, using custom and refined dictionaries, has provided the best results in the multitier architecture.

Conclusion Ø A multi-tier classification architecture for sentiment analysis has been proposed. It includes a multi-tier prediction model, which applies various supervised machine learning methods to predict sentiment levels. Ø Demonstrate ways to fine tune parameters, as well as techniques to reduce features for further improvement. Ø Increase the accuracy level in other multi-class text classification problems

Further Thought Ø Features Ø Word as Feature v Markov Blanket v Bag of Words v N-grams Ø Score Representation

Further Thought Ø Should we add models for Neutral, Positive and Negative reviews also? Ø How about add an another layer to combine results from different classifiers NB SVM RF Majority Vote? ? Another Model? ? SGD