Light SIDE Tutorial Carolyn Penstein Ros Language Technologies

Light. SIDE Tutorial Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

What is machine learning? n Automatically or semi-automatically ¨ Inducing rules from data ¨

Effective data representations make problems learnable… n n n Machine learning isn’t magic But

Sou. FLé Framework (Howley et al. , 2013) What properties of discourse are important

Sou. FLé Framework (Howley et al. , 2013) Transactive Knowledge Integration Person

Sou. FLé Framework (Howley et al. , 2013) Transactive Knowledge Integration Engagement Person Engagement

Sou. FLé Framework (Howley et al. , 2013) Authority Transactive Knowledge Integration Engagement Person

i • Definition of Transactivity • building on an idea expressed earlier in a

Transactivity (Berkowitz & Gibbs, 1983) n n Findings ¨ Moderating effect on learning (Joshi

Identifying Transactivity in Threaded Discussions n AUTHOR: Hans Michael blames his poor achievements on

Thread Structure Features n 2 AUTHOR: Hans Michael blames his poor achievements on a

Effective data representations make problems learnable… ! r be Re m Know your data!!

Essential Reading n Witten, I. H. , Frank, E. , Hall, M. (2011). Data

Automated Discourse Analysis n n n Howley, I. , Mayfield, E. & Rosé, C.

Applications to Learning Sciences Research n n n Howley, I. , Kumar, R. ,

Consider this simple example… Look for what distinguishes Questions and Statements in this dataset.

What are good features for text categorization? What distinguishes Questions and Statements? Not all

What are good features for text categorization? What distinguishes Questions and Statements? I versus

Represent text as a vector where each position corresponds to a term This is

Examples from Gallup Poll Data n Male from Virginia, age 30, negative: “I think

Basic Types of Features “Because the cost of healthcare is just outta sight crazy”

Basic Types of Features “the cost of healthcare” DT NN PRP NN

Part of Speech Tagging http: //www. comp. leeds. ac. uk/ccalas/tagsets/upenn. html 1. CC Coordinating

Part of Speech Tagging http: //www. comp. leeds. ac. uk/ccalas/tagsets/upenn. html 23. RP Particle

Basic Types of Features “the cost of healthcare” 4

Basic Types of Features “the cost of healthcare” YES

Basic Types of Features “the cost is too great. The cost is immense!” The

Basic Types of Features “the cost is too great. The cost is immense!” If

Basic Types of Features X X “the cost of healthcare”

Basic Types of Features “healthcare costs” “healthcare cost”

Clarification on Basic text feature extractor POS tagging happens before stemming or stopword removal

Feature Space Customizations n Feature Space Design ¨ Think like a computer! ¨ Machine

Effective Development and Evaluation Process in Light. SIDE

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook

Simple Cross Validation Fold: 1 n TEST 1 TRAIN 2 TRAIN 3 TRAIN 4

Simple Cross Validation Fold: 2 n TRAIN 1 TEST 2 TRAIN 3 TRAIN 4

Simple Cross Validation Fold: 3 n TRAIN 1 TRAIN 2 TEST 3 TRAIN 4

Simple Cross Validation Fold: 4 n TRAIN 1 TRAIN 2 TRAIN 3 TEST 4

Simple Cross Validation Fold: 5 n TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4

Simple Cross Validation Fold: 6 n TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4

Simple Cross Validation Fold: 7 n TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4

Avoiding Overfitting! Separate data for evaluation from data for exploration n We will refer

Remember!!!! n Use your development data for: ¨ Qualitative analysis before ML ¨ Error

Why is performance different? Men and women used language differently n Different focus n

Stretchy Patterns in Light. SIDE Looking at sentiment_sentences. csv 82

Configuring Stretchy Patterns Longer patterns and longer gaps lead to larger numbers of features

American Street Gangs Predict gang affiliation from posts • • • Crips, Bloods, Hoovers

Graffiti Based Style Features Graffiti Social messages Stylistic writing crossing out other gangs On

Character N-grams n Character bigrams can detect graffiti style features n Could also be

Parse Features Word based features lose all structure and order within sentences n Parse

Error Analysis Process High Level Overview n n Identify large error cells Make comparisons

Datasets Three datasets for age prediction: n Blogs from blogger. com 2500 frequency (targeted

Feature Splitting (Daumé III, 2007) General Domain A Domain B Why is this nonlinear?

Leveraging Subpopulations through Multi-Level Modeling

Feature Analysis n n Style features that distinguish Allied from Opposing differ by dominant

Feature Analysis n n Unigram features that distinguish Allied from Opposing don’t differ by

Subpopulations and Overfitting Example from gender prediction in blog data…

What is different in how men and women talk?

Confounded with other variables n Men sound older and women sound younger (Argamon et

Why do low level features overfit? n In a linear model, positive weights push

Why do low level features overfit? n What happens if the same feature predicts

Never saw MOH in train, so trained model will overpredict extent of swearing among

Evaluation of Domain Generality • • • Contrast random CV and leave-oneoccupation-out CV All

Slides: 141

Download presentation

Light. SIDE Tutorial Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Introduction

What is machine learning? n Automatically or semi-automatically ¨ Inducing rules from data ¨ Making predictions Data Learning Algorithm Model New Data Classification Engine Prediction

http: //lightsidelabs. com/research

Automatic Analysis Of Conversation

Effective data representations make problems learnable… n n n Machine learning isn’t magic But it can be useful for identifying meaningful patterns in your data when used properly Proper use requires insight into your data ?

Sou. FLé Framework (Howley et al. , 2013) What properties of discourse are important for learning discussions?

Sou. FLé Framework (Howley et al. , 2013) What properties of discourse are important for learning discussions? Person

Sou. FLé Framework (Howley et al. , 2013) Transactive Knowledge Integration Person

Sou. FLé Framework (Howley et al. , 2013) Transactive Knowledge Integration Engagement Person Engagement

Sou. FLé Framework (Howley et al. , 2013) Authority Transactive Knowledge Integration Engagement Person Authority Engagement

i • Definition of Transactivity • building on an idea expressed earlier in a conversation • using a reasoning statement I think the tube will get heavier because water is going in That’s true, but the important point is that water can flow in, but starch can’t flow out. 15

Transactivity (Berkowitz & Gibbs, 1983) n n Findings ¨ Moderating effect on learning (Joshi & Rosé, 2007; Russell, 2005; Kruger & Tomasello, 1986; Teasley, 1995) ¨ Moderating effect on knowledge sharing in working groups (Gweon et al. , 2011) Computational Work ¨ Can be automatically detected in: n Threaded group discussions (Kappa. 69) (Rosé et al. , 2008) n Transcribed classroom discussions (Kappa. 69) (Ai et al. , 2010) n Speech from dyadic discussions (R =. 37) (Gweon et al. , 2012) ¨ Predictable from a measure of speech style accommodation computed by an unsupervised Dynamic Bayesian Network (Jain et al. , 2012) 16

Identifying Transactivity in Threaded Discussions n AUTHOR: Hans Michael blames his poor achievements on a lack of giftedness in mathematics. ---------------From this one can conclude that his attribution is internal and stable. Internal because it comes from within himself. And stable because it is something that can't be changed. AUTHOR: Gerry >Michael blames his poor achievements on a lack of giftedness in mathematics. From… ------------Wow, that was a really good work. Right on! ------------From the case I could not however directly conclude that Michael thinks the task is too difficult for him. Instead I thought Michael thinks that he is too dumb for mathematics. -------------Therefore, I did not include something about that in my contribution. Social modes of coconstruction (Weinberger & Fischer, 2006) ¨ n To what degree or in what ways learners refer to the contributions of their learning partners Tag. Helper tools achieves reliability of. 69 Kappa (Rosé et al. , 2008)

Thread Structure Features n 2 AUTHOR: Hans Michael blames his poor achievements on a lack of giftedness in mathematics. ---------------From this one can conclude that his attribution is internal and stable. Internal because it comes from within himself. And stable because it is something that can't be changed. Thread structure features depth (numeric): the depth in the thread where a message appears ¨ parent_child_similarity (numeric): semantic similarity (cosine similarity) between the current message segment to all its parent message segments. The highest value is chosen ¨ AUTHOR: Gerry >Michael blames his poor achievements on a lack of giftedness in mathematics. From… ------------Wow, that was a really good work. Right on! ------------From the case I could not however directly conclude that Michael thinks the task is too difficult for him. Instead I thought Michael thinks that he is too dumb for mathematics. -------------Therefore, I did not include something about that in my contribution.

Evaluating Context-Based Features

Effective data representations make problems learnable… ! r be Re m Know your data!! ?

Essential Reading n Witten, I. H. , Frank, E. , Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques, third edition, Elsevier: San Francisco

Automated Discourse Analysis n n n Howley, I. , Mayfield, E. & Rosé, C. P. (2013). Linguistic Analysis Methods for Studying Small Groups, in Cindy Hmelo-Silver, Angela O’Donnell, Carol Chan, & Clark Chin (Eds. ) International Handbook of Collaborative Learning, Taylor and Francis, Inc. Rosé, C. P. , Wang, Y. C. , Cui, Y. , Arguello, J. , Stegmann, K. , Weinberger, A. , Fischer, F. , (2008). Analyzing Collaborative Learning Processes Automatically: Exploiting the Advances of Computational Linguistics in Computer-Supported Collaborative Learning, submitted to the International Journal of Computer Supported Collaborative Learning 3(3), pp 237 -271. Mu, J. , Stegmann, K. , Mayfield, E. , Rosé, C. P. , Fischer, F. (2012). The ACODEA Framework: Developing Segmentation and Classification Schemes or Fully Automatic Analysis of Online Discussions. International Journal of Computer Supported Collaborative Learning 7(2), pp 285 -305. Gweon, G. , Jain, M. , Mc Donough, J. , Raj, B. , Rosé, C. P. (2013). Measuring Prevalence of Other-Oriented Transactive Contributions Using an Automated Measure of Speech Style Accommodation, International Journal of Computer Supported Collaborative Learning 8(2), pp 245 -265.

Applications to Learning Sciences Research n n n Howley, I. , Kumar, R. , Mayfield, E. , Dyke, G. , & Rosé, C. P. (2013). Gaining Insights from Sociolinguistic Style Analysis for Redesign of Conversational Agent Based Support for Collaborative Learning, in Suthers, D. , Lund, K. , Rosé, C. P. , Teplovs, C. , Law, N. (Eds. ). Productive Multivocality in the Analysis of Group Interactions, edited volume, Springer. Howley, I. , Mayfield, E. , Rosé, C. P. , & Strijbos, J. W. (2013). A Multivocal Process Analysis of Social Positioning in Study Group Interactions, in Suthers, D. , Lund, K. , Rosé, C. P. , Teplovs, C. , Law, N. (Eds. ). Productive Multivocality in the Analysis of Group Interactions, edited volume, Springer. Adamson, D. , Dyke, G. , Jang, H. J. , Rosé, C. P. (2014). Towards an Agile Approach to Adapting Dynamic Collaboration Support to Student Needs, International Journal of AI in Education 24(1), pp 91121.

Text Teaser

Consider this simple example… Look for what distinguishes Questions and Statements in this dataset. What clues do you see?

What are good features for text categorization? What distinguishes Questions and Statements? Not all questions end in a question mark.

What are good features for text categorization? What distinguishes Questions and Statements? I versus you is not a reliable predictor

What are good features for text categorization? What distinguishes Questions and Statements? Not all WH words occur in questions

Light. SIDE: A quick tour

Basic Text Feature Extraction

Represent text as a vector where each position corresponds to a term This is called the “bag of words” approach Cheese Cows Eat Hamsters Make Seeds n n Cows make cheese. 110010 Hamsters eat seeds. 001101

Represent text as a vector where each position corresponds to a term This is called the “bag of words” approach But same representation for “Cheese makes cows. ”! Cheese Cows Eat Hamsters Make Seeds n. Cows make cheese. n 110010 n. Hamsters n 001101 eat seeds.

Examples from Gallup Poll Data n Male from Virginia, age 30, negative: “I think it’ll increase costs for everyone. ” n Female from Illinois, unknown age, positive: “Because the cost of healthcare is just outta sight crazy” n Male from Michigan, age 70, positive: “the cost”

The Gallup Poll Dataset 44

Basic Types of Features “Because the cost of healthcare is just outta sight crazy”

Basic Types of Features “the cost of healthcare” DT NN PRP NN

Part of Speech Tagging http: //www. comp. leeds. ac. uk/ccalas/tagsets/upenn. html 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PP Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative

Part of Speech Tagging http: //www. comp. leeds. ac. uk/ccalas/tagsets/upenn. html 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund/present participle 30. VBN Verb, past participle 31. VBP Verb, non-3 rd ps. sing. present 32. VBZ Verb, 3 rd ps. sing. present 33. WDT wh-determiner 34. WP wh-pronoun 35. WP Possessive whpronoun 36. WRB wh-adverb

Basic Types of Features “the cost of healthcare” DT NN PRP NN

Basic Types of Features “the cost of healthcare” 4

Basic Types of Features “the cost of healthcare” YES

Basic Types of Features “the cost is too great. The cost is immense!” The value of the feature is the number of times it occurs, rather than 1 if it occurs or 0 otherwise, which is the default.

Basic Types of Features “the cost is too great. The cost is immense!” If you uncheck this, punctuation will be ignored and stripped out of the representation.

Basic Types of Features X X “the cost of healthcare”

Basic Types of Features “healthcare costs” “healthcare cost”

Clarification on Basic text feature extractor POS tagging happens before stemming or stopword removal n POS bigrams are not affected by stopword removal – POS tags for stopwords will still be included n On word n-grams, the only n-grams that will be dropped in the case of stopword removal are ones that consist only of stopwords n

Feature Space Customizations n Feature Space Design ¨ Think like a computer! ¨ Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful ¨ Look for approximations If you want to find questions, you don’t need to do a complete syntactic analysis n Look for question marks n Look for wh-terms that occur immediately before an auxilliary verb n

Effective Development and Evaluation Process in Light. SIDE

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes Perfect on training data

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes Performance on Not perfect on training testing data? data

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes IMPORTANT! If you evaluate the performance of your rule on the same data you trained on, you won’t get an accurate estimate of how well it will do on new data.

Simple Cross Validation Fold: 1 n TEST 1 TRAIN 2 TRAIN 3 TRAIN 4 TRAIN 5 TRAIN 6 TRAIN 7 Let’s say your data has attributes A, B, and C You want to train a rule to predict D n First train on 2, 3, 4, 5, 6, 7 n and apply trained model to 1 n The results is Accuracy 1 n

Simple Cross Validation Fold: 2 n TRAIN 1 TEST 2 TRAIN 3 TRAIN 4 TRAIN 5 TRAIN 6 TRAIN 7 Let’s say your data has attributes A, B, and C You want to train a rule to predict D n First train on 1, 3, 4, 5, 6, 7 n and apply trained model to 2 n The results is Accuracy 2 n

Simple Cross Validation Fold: 3 n TRAIN 1 TRAIN 2 TEST 3 TRAIN 4 TRAIN 5 TRAIN 6 TRAIN 7 Let’s say your data has attributes A, B, and C You want to train a rule to predict D n First train on 1, 2, 4, 5, 6, 7 n and apply trained model to 3 n The results is Accuracy 3 n

Simple Cross Validation Fold: 4 n TRAIN 1 TRAIN 2 TRAIN 3 TEST 4 TRAIN 5 TRAIN 6 TRAIN 7 Let’s say your data has attributes A, B, and C You want to train a rule to predict D n First train on 1, 2, 3, 5, 6, 7 n and apply trained model to 4 n The results is Accuracy 4 n

Simple Cross Validation Fold: 5 n TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4 TEST 5 TRAIN 6 TRAIN 7 Let’s say your data has attributes A, B, and C You want to train a rule to predict D n First train on 1, 2, 3, 4, 6, 7 n and apply trained model to 5 n The results is Accuracy 5 n

Simple Cross Validation Fold: 6 n TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4 TRAIN 5 TEST 6 TRAIN 7 Let’s say your data has attributes A, B, and C You want to train a rule to predict D n First train on 1, 2, 3, 4, 5, 7 n and apply trained model to 6 n The results is Accuracy 6 n

Simple Cross Validation Fold: 7 n TRAIN 1 TRAIN 2 TRAIN 3 TRAIN 4 TRAIN 5 TRAIN 6 TEST 7 Let’s say your data has attributes A, B, and C You want to train a rule to predict D n First train on 1, 2, 3, 4, 5, 6 n and apply trained model to 7 n The results is Accuracy 7 n Finally: Average Accuracy 1 through Accuracy 7 n

Avoiding Overfitting! Separate data for evaluation from data for exploration n We will refer to the exploration set as the Dev Set n We will refer to the evaluation set as the cross-validation set n You should also have a final test set you never look at until you think you are done! n

Remember!!!! n Use your development data for: ¨ Qualitative analysis before ML ¨ Error analysis ¨ Ideas for design of new features n Use your cross validation data for: ¨ Evaluating n your performance Never include the data you are testing on in the data you do feature selection with!!!

Evaluation

Why is performance different? Men and women used language differently n Different focus n ¨ Women had a more personal focus ¨ Men had a more national/objective focus

Special Text Features

Stretchy Patterns in Light. SIDE Looking at sentiment_sentences. csv 82

Configuring Stretchy Patterns Longer patterns and longer gaps lead to larger numbers of features n Categories are useful both for abstraction and for anchoring the patterns n

Regular Expressions 89

American Street Gangs Predict gang affiliation from posts • • • Crips, Bloods, Hoovers o crips started in South Central LA o Pirus, Bloods, Hoovers from crips Chicago based o People Nation § vice lords, latin kings, stones o Folk nation § gangster disciples Trinitarios o hispanic gang based in NYC

Graffiti Based Style Features Graffiti Social messages Stylistic writing crossing out other gangs On the board c ck p h b e s c ck ckrab, ckome cc fucc, blocc pk pkut, . . . hk whky, hkappens bk bk 1, bkang 3 3 ast 5 5 hit c^ c^rime, c^uh

Character N-grams n Character bigrams can detect graffiti style features n Could also be used to identify consistent endings on words (i. e. , that indicate formality or gender)

Parse Features Word based features lose all structure and order within sentences n Parse features can capture that n But they are SLOW!! n

Error Analysis

Error Analysis Process High Level Overview n n Identify large error cells Make comparisons ¨ Ask Goal: We want to discover how to rerepresent the data so that instances with the same class value look more similar to one another and instances with different class values look more different yourself how it is similar to the instances that were correctly classified with the same class (vertical comparison) ¨ How it is different from those it was incorrectly not classified as (horizontal comparison)

100

101

* Testing bigrams as an alternative….

113

114

115

116

117

Heterogeneous Datasets

Datasets Three datasets for age prediction: n Blogs from blogger. com 2500 frequency (targeted crawl by Schler et al. , 2006; 9, 600 training docs @13 K tokens) n Fisher corpus of telephone conversation transcripts (Cieri et al. , 2004; 5, 957 training docs @3 K tokens) n Online forum for breast cancer patients, breastcancer. org (2, 330 training docs @23 K tokens) 0 10 age 90 Age distributions in datasets Datasets divided into training, development and test set

Feature Splitting (Daumé III, 2007) General Domain A Domain B Why is this nonlinear? It represents the interaction between each feature and the Domain variable Now that the feature space represents the nonlinearity, the algorithm to train the weights can be linear.

Leveraging Subpopulations through Multi-Level Modeling

Gang Alliances

Gangs Data

126

Feature Analysis n n Style features that distinguish Allied from Opposing differ by dominant gang Crips: When the dominant Allied: b. Caret gang is in an allied ¨ Opposing: CC, PK, c. Caret ¨ n Bloods: Allied: XO, CC ¨ Opposing: h. Caret, BK ¨ n Latin Kings: Allied: CC, XO ¨ Opposing: 5 S ¨ thread, we see style features that unite them against opposing gangs.

Feature Analysis n n Style features that distinguish Allied from Opposing differ by dominant gang Crips: Allied: b. Caret When the dominant ¨ Opposing: CC, PK, c. Caret gang is in an ¨ n Bloods: Allied: XO, CC ¨ Opposing: h. Caret, BK ¨ n Latin Kings: Allied: CC, XO ¨ Opposing: 5 S ¨ opposing thread, we also see features that unite the opposing gangs against them.

Feature Analysis n n Unigram features that distinguish Allied from Opposing don’t differ by dominant gang as much as style features Universal: We see ¨ Allied: lmao, you, crew relationship ¨ Opposing: forever, wtf, where words, but not gang identity n Crips: words. Allied: lol ¨ Opposing: know, about ¨ n Bloods: Allied: niggas, the ¨ Opposing: at ¨

Subpopulations and Overfitting Example from gender prediction in blog data…

What is different in how men and women talk?

Confounded with other variables n Men sound older and women sound younger (Argamon et al. , 2007) n Men sound more like non-fiction and women sound more like fiction (Argamon et al. , 2003)

Why do low level features overfit? n In a linear model, positive weights push the decision towards one class while negative weights push the decision towards the other class n The magnitude of the weight indicates how much of a push that feature gives

Why do low level features overfit? n What happens if the same feature predicts age, gender, and social class? ¨ If you are predicting gender, then the average value for each feature assumes the mix of age and social class in the data set you trained for n n ¨ So The weights normalize for this mix If the mix changes, then the normalization will be wrong the weights won’t predict gender correctly anymore on datasets where the mix of those other factors is different

Never saw MOH in train, so trained model will overpredict extent of swearing among males on test set Train MYL FYH MYL MOL MYL FYH FYH FOH MYH FOH MYL MOL FYL MOH FYL FOH MOH FOL MOL FOL MYH MOH FYL FOH FOL FYL MOH FOL FYL MYH MOL MYH FYH MOL MYH FOH FYH MOH FOH MOL MYH MYL MOL MOH FYL MOL MYH Test MYH FYL MOL MYH MOH

Evaluation of Domain Generality • • • Contrast random CV and leave-oneoccupation-out CV All feature space representations show significant drop between random CV and leave-oneoccupation-out CV Only stretchy patterns remain significantly above random performance