EMOJI IDETIFICATION AND PREDECTION IN HEBREW POLITICAL CORPUS

EMOJI IDETIFICATION AND PREDECTION IN HEBREW POLITICAL CORPUS Chaya Liebeskind 02/07/2019 In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Emoji “a digital image that is added to a message in electronic communication in order to express a particular idea or feeling” Cambridge dictionary In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

ANY SYSTEM THAT AIMS TO ADDRESS THE TASK OF MODELING SOCIAL MEDIA COMMUNICATION NEED TO DEAL WITH THE USAGE OF EMOJIS In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Today’s Tasks 1 • Emoji Identification 2 • Emoji Prediction In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Emoji Identification • A binary classification task of determining if a given text message includes emojis by relying on either textual content or meta-data analysis כל הכבוד , שאפו Šapw, Kl hkbwd Well done! Hats off to מידי מאוחר Mawxr midi Too late In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Emoji Prediction Predicting the emojis that appear in a given text message by relying exclusively on the textual content of that message ● Can be viewed as a multi-label classification problem # 1 Comment חזק וברוך xzq wbrwk 2 be strong and blessed אל תפסיק לחלום al tpsiq lxlwm 3 do not stop dreaming לחיים lxiim 4 cheers! כאלה מקסימים יחד kalh mqsimim ixd Label these are lovely together 5 תבורכו tbwrkw you will be blessed In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

e g n e l l a h C Applying Machine Learning (ML) approach for emoji identification and prediction in Hebrew ● Machine Learning (ML) approach ○ Composed of two general steps: ■ ■ ● ● Learn the model from a training corpus Classify a test corpus based on the trained model Hebrew is characterized by highly productive morphology It has hardly been investigated before ● Liebeskind, C. , & Liebeskind, S. (2019, May). Emoji Prediction for Hebrew Political Domain. In Companion Proceedings of The 2019 World Wide Web Conference (pp. 468 -477). ACM. In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Step #1: Creating a political dataset ● All posts of Members of Knesset (MKs) between 2014 -2016 (n=130 MKs, m=33, 537 posts) ○ ● ● ● Downloaded via Facebook Graph API Comments to these posts (n=5. 37 M comments posted by 702, 396 commentators) 98, 865 of the comments include at least one of the 786 types of emojis There are 50, 243 comments with a single emoji Emojis are used by 41, 789 of the commentators Previously experimented by Liebeskind et al, 2017 and Liebeskind & Nahon, 2018 In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Step #1: Emoji Identification • Positive examples – 80, 276 comments which include at least one emoji and less than 6 emojis • Negative examples – An equal number of comments without emojis were randomly selected • For each class, we randomly selected 90% of the data (160, 552 comments) for training, and the remaining 10% (16, 057) for the test set In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Step #1: Emoji Prediction • Only Facebook comments with a single emoji were included in the task dataset – A single-label classification problem – Including comments with multiple appearances of the same emoji • such as ( כל הכבוד לגיבור שלנו kl hkbwd lgibwr šlnw – well done our hero) • 78, 147 comments that include 593 emoji types • Classification was performed on the comments that include one of the twenty emojis that occur most frequently in our dataset In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Step #1: Emoji Prediction (cont. ) In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Step #2: supervised classification ● ● Supervised machine learning approach Learn the model from a training corpus ● ● Decide what features of the text are relevant Decide how to encode these features In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Text representations ● N-gram representation ○ ○ ○ A continuous sequence of n words A high dimensional sparse representation Critical for short texts where most words have only one occurrence In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Text representations (Cont. ) ● Character n-grams representation ○ ○ ○ Strings of length n Much less character combinations than n-gram combinations Noise and misspellings have smaller impact on substring patterns than on n-gram patterns ■ ○ Character n-grams can be quite effective for short informal text classification Still produces a considerably large feature set AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 In. SITE 2019 Jerusalem,

Metadata • Metadata on the post – number of characters, number of words, normalized number of punctuations, normalized number of emojis – Facebook depended features: • • Number of “reactions” that the post got Number of “shares” Number of comments that the post got Post type – photo, link, status, video, and event – Features on the writer of the post: MK identifier and MK gender In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Metadata (Cont. ) • Metadata on the comment – number of characters, number of words, temporal information, i. e. , hour, day, and month of the comment publication – Facebook depended features: • Number of “likes” that the comment got • Number of comments on the comment, • Boolean feature, which indicates whether the commentator also "liked" the status – Additional features: the number of occurrences of the MK writer of the post and the number of occurrences of other MKs, either aliens or rivals of the post writer In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Metadata utilization • Emoji Identification 1 • Metadata on the post and the comment • Some of the metadata might trigger the commentator to include an emoji in the comment • post with an emoji or popular post with many “likes” 2 • Emoji Prediction • metadata on the comment In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Results: Emoji Identification Logistic Regression results for the n-grams and character n-grams representations: Representation Precision Recall F 1 Accuracy Char 2 -grams 0. 6623 0. 6614 0. 6629 Char 3 -grams 0. 6896 0. 6873 0. 6874 0. 6894 Char 4 -grams 0. 6867 0. 6843 0. 6844 0. 6865 Char 5 -grams 0. 6849 0. 6825 0. 6826 0. 6847 Unigrams 0. 6843 0. 6804 0. 6802 0. 6832 Bigrams 0. 6522 0. 6391 0. 6346 0. 6450 Trigrams 0. 6381 0. 5736 0. 5252 0. 5876 In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Results: Emoji Identification (Cont. ) Representation Char 3 -grams Metadata feature set Char 3 -grams + feature selection Combined feature set + feature selection Precision 0. 6896 0. 25 0. 6851 0. 7001 0. 6959 Recall 0. 6873 0. 5 0. 6823 0. 6984 0. 6941 F 1 0. 6874 0. 3333 0. 6823 0. 6987 0. 6943 Accuracy 0. 6894 0. 5 0. 6846 0. 7001 0. 6959 chi-square feature selection • The combined representation significantly increases the accuracy of the character 3 -grams representation • The feature selection method does not improve the character 3 -grams representation • When applied on the combined feature set, it decreases the combined representation performance In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Analysis: Emoji Identification • 50% of the features from the metadata feature set were selected. • Out of the 10 selected features: • 5 from information on the post • type, normalized number of punctuations, number of characters, MK identifier • 5 from information on the comment • publication day of the week, number of comments, number of “likes”, number of characters, number of occurrences of the MK writer of the post. In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Results: Emoji Prediction Logistic Regression results for the text representations: Representation Char 2 -grams Char 3 -grams Char 4 -grams Char 5 -grams Unigrams Bigrams Trigrams Precision Recall F 1 Accuracy 0. 3519 0. 4661 0. 4802 0. 4880 0. 4401 0. 4275 0. 3808 0. 1647 0. 1910 0. 1854 0. 1805 0. 1725 0. 1364 0. 1027 0. 1729 0. 2121 0. 2058 0. 2013 0. 1896 0. 1518 0. 1065 0. 3604 0. 3676 0. 3665 0. 3588 0. 3525 0. 3055 0. 2689 The F 1 advantage of the character 3 -grams representation over the unigram representation is statistically significant There are three letters in the Hebrew root In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Results: Emoji Prediction (Cont. ) Representation Char 3 -grams - Combined Char 3 -grams - Feature Selection Fast. Text Precision 0. 4661 0. 4030 0. 4517 0. 2217 Recall 0. 1910 0. 1835 0. 1957 0. 1494 F 1 0. 2121 0. 1909 0. 2181 0. 1483 Accuracy 0. 3676 0. 3782 0. 3859 0. 3701 chi-square feature selection • The combined representation slightly increases the accuracy of the character 3 -grams representation but decrease its F 1 • The feature selection method removes the metadata features, yet it outperforms the character 3 -grams representation • The same is true for the character 4 -grams representation • The character 3 -grams representation outperforms the Fast. Text baseline In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Analysis: Emoji Prediction In many of the cases both the classifier decision and the emoji that was chosen by the commentator seem to fit the comment content In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Conclusions and Future Work ● ● Explored two tasks: emoji identification and emoji prediction as a single-label classification problem Created a Hebrew dataset for both of the tasks Investigated two text representations and combined metadata features on both the post and the comment Showed that while the metadata features improve the emoji identification accuracy, the contribution of the metadata for emoji prediction is minor and it is better to apply feature selection Plan to address the multi-label setting of the emoji prediction task Plan to investigate deep learning models In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

QUESTIONS? In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

APPENDIX In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Are emojis used as a substitute for words? top-100 commentators with high usage of emojis (above 50 comments) ● ● Most of the commentators use a similar number of words with and without emojis More emojis do not mean less words In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Does the diversity of the emojis depend on the number of comments? top-100 commentators with high usage of emojis (above 50 comments) The diversity of the emojis depends on user preference In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Semantic vector representations ● Apply four dimensional reduction methods for semantic analysis ○ ○ 1. 2. 3. 4. Associate similar words to similar vectorial representation Representations are built using entirely unsupervised distributional analysis of large amount of unlabeled text Word Embedding Latent Semantic Analysis (LSA) Latent Dirichlet Allocation (LDA) Random Projection (RP) In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Results: semantic vector representations In previous works, the Word Embedding approach was exclusively selected as a dimensional reduction method. However, all the other reduction methods that we have suggested achieve a higher F 1 score ● The F 1 advantage of the LSA representation, which is lower than the F 1 of the RP and LDA representations, is statistically significant ○ According to the two-sided Wilcoxon signed-rank test at the 0. 05 level In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,

Results: text representations ● ● Unigram and character n-grams representations are better than all the semantic vector representations Unigram outperformed all the other n-grams representations significantly The best representations were the character 3 -grams and 4 -grams Fast. Text: Accuracy: 0. 3628, Precision: 0. 2113, Recall: 0. 1504, F 1: 0. 1489 In. SITE 2019 AN INFORMING SCIENCE INSTITUTE CONFERENCE June 30 - July 5, 2019 Jerusalem,