Domainbased Lexicon Enhancement for Sentiment Analysis A Muhammad

Domain-based Lexicon Enhancement for Sentiment Analysis A. Muhammad, N. Wiratunga, R. Lothian, R. Glassey IDEAS Research Institute, Robert Gordon University, Aberdeen

Introduction • Sentiment Classification Text • Sentiment Analysis – A wider task, involves identification of • Object/Aspects • Opinion holder • Time BCS-SGAI-SMA-2013, Cambridge UK 2

Sentiment Classification • Machine Learning The movie is good : The movie is horrible : I don’t like the movie : I love the movie : … + + Classifier e. g. NB, SVMs Model The movie is nice : ? BCS-SGAI-SMA-2013, Cambridge UK 3

Sentiment Classif… Cont’d • Lexicon-Based Contextual analysis/Ag gregation The movie is nice : ? BCS-SGAI-SMA-2013, Cambridge UK 4

Lexicon Generation Manual Corpus • Could be too narrow Dictionary • Could be too General • This movie is fantastic • Ugh!! this movie sucks! BCS-SGAI-SMA-2013, Cambridge UK 5

Sentiment Lexicons • Dictionary-based: Senti. Word. Net (Baccianella et. al, 2010) • Corpus-based – Generated from target domain – Existing approaches rely on well-formed spelling/grammar Corpus Seed Excellent good bad Poor terrible nice horrible but coocurrence and happy affordable rubbish enjoyable mad … (Hatzivassiloglou Turney, and 2002 Mckeown, 1997) BCS-SGAI-SMA-2013, Cambridge UK 6

Corpus-based lexicon • Distant-Supervision (Read 2005, Go et al 2009) – Automated approach for labelling – Based on appearance of emoticons ( , ) I’m happy with chocolate on vday BCS-SGAI-SMA-2013, Cambridge UK I’m at work today 7

Scores Generation • Proportion-based Term + score - score ugh 0. 077 0. 923 sucks 0. 132 0. 868 luv 0. 958 0. 042 xoxo 0. 792 0. 208 … … … – Scores are compatible with Senti. Word. Net BCS-SGAI-SMA-2013, Cambridge UK 8

Integration with Senti. Word. Net • General Scores are extracted from Senti. Word. Net gs(t) ds(t) x x gs(t) x Aggregation ds(t) Weighted Avr BCS-SGAI-SMA-2013, Cambridge UK 9

Evaluation • 20, 000 Dist-Sup tweets used to: – Generate domain lexicon – Train Machine Learning classifiers • For comparison • 359 hand-labelled tweets used for evaluation BCS-SGAI-SMA-2013, Cambridge UK 10

Evaluation Cont’d • Individual lexicons Vs Combined – General < Domain < Combined • Difference not significant btw Domain and Combined • Machine learning Vs Combined – SVM < NB < Log. Reg < Combined • Difference not significant btw Log. Reg and Combined BCS-SGAI-SMA-2013, Cambridge UK 11

Evaluation Cont’d • Varying data sizes – Performance improves with increasing size for all except SVM 80 70 60 50 SVM 40 NB Log. Reg 30 LET 20 10 0 4, 000 8, 000 12, 000 16, 000 BCS-SGAI-SMA-2013, Cambridge UK 20, 000 12

Conclusions • Sentiment lexicon is generated using distantsupervision • Sentiment classification improves with combination of domain-dependent and domain-independent lexicons • Accuracy of the combination is better than machine learning BCS-SGAI-SMA-2013, Cambridge UK 13

Future work • Lexicon refinement • Improve aggregation strategy • Extend approach to other Social media platforms • Extend Dist-sup to neutral labelling • Experiment with ‘big data’ BCS-SGAI-SMA-2013, Cambridge UK 14

Thank you for Listening! BCS-SGAI-SMA-2013, Cambridge UK 15