1 CrowdsourcingBased Text Classification TC Yaakov Ha CohenKerner

1 Crowdsourcing-Based Text Classification (TC) בס"ד Yaakov Ha. Cohen-Kerner Department of Computer Science Jerusalem, Israel kerner@jct. ac. il Copyright © 2018 All Rights Reserved. Yaakov Ha. Cohen-Kerner 17 -Jun-21

Short Bio בס"ד 2 • • Head of the Research Authority, Jerusalem College of Technology Associate Professor at the The Computer Science Department An author of more than 82 scientific papers and book sections Research Interests: Various data mining tasks, e. g. , • {Text, Image, Speech} classification and clustering • Natural language processing • Word completion and prediction • Machine learning • Key-phrase extraction • Summarization • Citation extraction and analysis • Plagiarism detection • Exposing the secrets between the lines (who wrote, when, lies, …) Computer game playing • Composition of chess and checkers problems 17 -Jun-21

3 What is Text Classification? Text classification (TC), also called text categorization, is the supervised learning task of assigning natural language text documents to one or more predefined categories (Sebastiani 2002).

4 Relevant research domains TC is one of the most fundamental problems in the ML and DM literature. TC is an important component in domains such as: text filtering text indexing information extraction information retrieval text mining word sense disambiguation

5 Examples of Types of Text Classification TC according to categories (usually based on content words and/or n-grams) Domain classification Topic classification Stylistic classification (usually based on various linguistic features) Authorship attribution Authors' age, gender, … Literary genres (action, comedy, crime, fantasy, historical, political) Sentiment (positive, negative)

Some of my Previous studies in TC 6 Cuisine: Classification using Stylistic Feature Sets and/or Name-Based Feature Sets / JASIST (Rank A), Vol. 61, 8, pp. 1644 -1657, 2010. News articles classification using Random Forests and weighted multimodal features / IRFC 2014, LNCS 8849, pp. 63 -75, Berlin: Springer-Verlag, 2014. Classifying True, False and CV Stories using Word Ngrams / Cybernetics and Systems, (Rank B), 47(8): 629649, 2016. Stance Classification of Tweets using Skip Char N-Grams / ECML-PKDD, (Rank A), pp. 266 -278, Springer LNCS Vol.

What is Crowdsourcing? 7 Crowdsourcing is the process of getting work, usually online, from a crowd of people (Daily Crowdsource) The idea is to take work and outsource it to a crowd of workers. Famous Examples: Wikipedia, Waze

Direct Crowdsourcing 8 Direct Crowdsourcing – Company –> Community In this, the company reaches out to community directly by several means such as social media to solve collect opinions from people, on an idea or to assist them with an initiative (e. g. developing a new product).

Indirect Crowdsourcing 9 Indirect Crowdsourcing – Company hosts; Client (buyer) -> Community (seller) In this, a company hosts, or provides a platform for crowdsourcing. Example: Amazon Mechanical Turk (AMT). This allows clients (buyers) place their need (known as Human Intelligence Task (HIT)), and provides compensation to workers (community) if he/she

10 Examples of Suitable domains for Crowdsourcing-Based Text Classification “Hard” classification tasks for computers Rating of text snippets is a hard task for computers due to the need to understand a variety of syntactic and semantic idiosyncrasies contained in the text. Image labeling (especially complex tasks)

Examples of Suitable domains for Crowdsourcing-Based Text Classification 11 “Hard” classification tasks for computers Humour recognition or Humour classification (Costa et al. , 2011). Humour is subjective, as it influenced by the contextual meaning of the joke, and can vary accordingly with culture, region, race or sex. Economic news articles were classified using supervised learning and crowdsourcing (Brew et al. , 2010).

12 Previous Crowdsourcing. Based Text Classification Costa, J. , Silva, C. , Antunes, M. , & Ribeiro, B. (2011) On using crowdsourcing and active learning to improve classification performance They proposed a crowdsourcing active learning approach that was tested with Jester data set, a text humour classification benchmark, resulting in promising improvements over baseline results.

13 Previous Crowdsourcing. Based Text Classification Paul, S. A. , Hong, L. , & Chi, E. H. (2011) What is a question? Crowdsourcing tweet categorization They report their experiences with thousands of people using Mechanical Turk to perform a large-scale text classification task - classifying Twitter updates as questions or not. They soon realized that they were paying spammers too and hence they modified their study so that they only paid those Turkers who rated at least 4 out of

14 Previous Crowdsourcing. Based Text Classification Radu Machedon, William Rand, Yogesh Joshi (2013) Automatic Crowdsourcing-Based Classification of Marketing Messaging on Twitter They demonstrate that a reasonably effective classifier can be created to identify the nature of Tweets based on crowdsourced training data Informative persuasive

15 Previous Crowdsourcing. Based Text Classification Sun, C. , Rampalli, N. , Yang, F. , & Doan, A. (2014) Chimera: Large-scale classification using machine learning, rules, and crowdsourcing They describe Chimera, a system capable to classify tens of millions of products into 5000+ product types They argue that at large scales crowdsourcing is critical, but must be used in combination with learning, rules, and in-house analysts

16 Previous Crowdsourcing. Based Text Classification Zubiaga, A. , Liakata, M. , Procter, R. N. , Bontcheva, K. and Tolmie, P. (2015) Crowdsourcing the annotation of rumourous conversations in social media They present a crowdsourcing-based annotation scheme to create high quality datasets of rumourous conversations from social media. The rumour annotation scheme is validated through comparison between crowdsourced and reference annotations.

17 Previous Crowdsourcing. Based Text Classification Adams, B. , & Mc. Kenzie, G. (2018) Crowdsourcing the character of a place: Character‐level convolutional networks for multilingual geographic text classification The authors show that their model works for any language without text preprocessing and is competitive with state-of-theart word-based models for classification of multilingual geographic noisy text. Their model was tested on 4 crowdsourced datasets (Wikipedia articles, Twitter posts, …). However, currently word-based methods still require less

Potential research directions 18 Crowdsourcing-Based Text Classification for language learning materials Possible tasks: Learning of language as a function of age, gender and Identifying types of learning errors Any other ideas? Research cooperation (STSM or email

Thank you very much 19