SWPACA February 2015 Christian F Hempelmann ontologytamuc edu

SWPACA, February 2015 Christian F. Hempelmann (ontology@tamuc. edu) Tools for Quantifying Masculinity in The Hobbit

Prologue Ø Lord Kelvin: “when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind…” Ø most computational linguistics/statistics (psychology)/math: non-linguistic, non-literary text-to-number methods Ø Ø don’t do language and literary texts justice (treat it as data) can only tell that two texts are more similar or different can’t tell why: blind not very useful for mixed qualitative/quantitative research Ø focus here on methods that do language (at least some) justice 2

Overview Ø Ø introduce tools for text processing and analysis comparative focus (because of general blindness to text meaning) preliminary data: first, middle, last Hobbit chapters hardware and OS Ø Apple Mac. Book Pro (Retina, Mid 2012) Ø Mac OS X 10. 2 (Yosemite) Ø tools Ø Ø Ø 3 Text. STAT Lexomics UAM Corpus Tool LIWC Coh-Metrix, “spreading activation” semantics

Text. STAT v 2. 9 C Ø http: //neon. niederlandistik. fuberlin. de/en/textstat/ Ø Windows, Linux, Mac OS; free Ø corpus gatherer (web or text files) Ø concordancer Ø word frequency list generator Ø not language-specific (works with many encodings) Ø search knows regular expressions Ø ease-of-use: well, runs inside Python… 4

Text. STAT v 2. 9 C Ø launching 5

Text. STAT v 2. 9 C Ø corpus gathering Ø word counts (eyeball “TF/IDF”) Ø one-line concordance Ø expanded-context concordance 6

Text. STAT v 2. 9 C Ø corpus gathering Ø word counts (eyeball “TF/IDF”) Ø one-line concordance Ø expanded-context concordance 7

Text. STAT v 2. 9 C Ø corpus gathering Ø word counts (eyeball “TF/IDF”) Ø one-line concordance Ø expanded-context concordance 8

Text. STAT v 2. 9 C Ø corpus gathering Ø word counts (eyeball “TF/IDF”) Ø one-line concordance Ø expanded-context concordance 9

Text. STAT v 2. 9 C Ø corpus gathering Ø word counts (eyeball “TF/IDF”) Ø one-line concordance Ø expanded-context concordance 10

UAM Corpus Tool Ø Ø Ø http: //www. wagsoft. com/Corpus. Tool/ Windows, Mac OS; free text segmentation text annotation corpus collection puts annotation of whole text or segments of text into "layers” (like xml/rdf markup) Ø annotation can be manual/automatic/mixed 11

UAM Corpus Tool Ø files 12

UAM Corpus Tool Ø manual tagging layers 13

UAM Corpus Tool Ø manual tagging layers 14

UAM Corpus Tool Ø manually annotating: processes 15

UAM Corpus Tool Ø automatic annotation: part of speech 16

UAM Corpus Tool Ø automatic annotation: stanford tree parse 17

UAM Corpus Tool Ø statistics: numbers for a given tag, here mental processes 18

UAM Corpus Tool Ø statistics: most frequent words in a given tag 19

UAM Corpus Tool Ø search by annotation: processes/ mental 20

Lexomics Ø Ø Ø Ø 21 http: //lexos. wheatoncollege. edu online, Python; free geared (but not restricted) to Old English, Latin very good text cleaner (“scrubber”) text segmenter word frequency counter visualization text similarity dendrograms

Lexomics Ø file upload 22

Lexomics Ø cleaning texts 23

Lexomics Ø rolling window graph (“hobbit”, 20 -word window) 24

Lexomics Ø word cloud (stop words removed*): but don’t confuse visualization with analysis! 25 *stop word lists for many languages: https: //code. google. com/p/stop-words/

Lexomics Ø “Bubble. Viz” (stop words removed): again, don’t confuse visualization with analysis Ø also: circos for relational data http: //mkweb. bcgsc. ca/ tableviewer/ 26

LIWC v 3. 0 Ø Ø Ø http: //www. liwc. net/ (linguistic inquiry and word count) Mac OS, Windows; $90 (lite $30) disciplinary background: psychology (psycholinguistics) extensively validated since 2001 connects keywords lists to psychological variables method: keyword lists (“dictionary”) and percentage counts Ø automatically segment sentences (after non-sentencefinal periods have been manually removed) Ø manual segmentation option 27

LIWC v 3. 0 Ø 4500 words in 76 output variables Ø 22 linguistic dimensions (e. g. , percentage of words that are pronouns, articles, auxiliary verbs, etc. ) [cf. federalist paper author ident. per preposition] Ø 32 word categories tapping psychological constructs (e. g. , affect, cognition, biological processes) Ø 7 personal concern categories (e. g. , work, home, leisure activities) Ø 3 paralinguistic dimensions (assents, fillers, nonfluencies) Ø 12 punctuation categories (periods, commas, etc) 28

LIWC v 3. 0 Ø potentially relevant categories: Ø positive emotion, e. g. : love, nice, sweet Ø negative emotion, e. g. : hurt, ugly, nasty Ø sadness, e. g. : crying, grief, sad Ø causation, e. g. : effect, because, hence Ø death, e. g. : bury, coffin, kill Ø and you can build and validate your own dictionaries… Ø coming soon (Reysen, Reid, Hempelmann) 29

LIWC v 3. 0 Ø chapter 1 Ø Ø is lower in anger [all comparatively] higher in insight higher in visual perception higher in ingestion (eating, drinking) Ø chapter 10 Ø Ø higher in sadness lower in certainty lower in personal concerns (achievement, leisure, home) lower in religion Ø chapter 19 Ø higher in positive and lower in negative emotions Ø lower in discrepancy, tentativeness, inhibition, and exclusiveness Ø higher in motion, time, and money 30

LIWC v 3. 0 Ø positive emotion in ch. 19 31

LIWC v 3. 0 Ø another problematic category: sexuality [really not in Hobbit or LOTR] – “and comfortable father got something a bit queer in his makeup” – "and with the spike on his staff scratched a queer sign on the hobbit’s beautiful green frontdoor” – "he said with his hand on his breast” – “he loved smokerings” – "excitable little fellow said gandalf as they sat down again” – “queer fits but he is one of the best” • dictionaries have to be appropriate for, e. g. , – genre – period – author 32

Even More Complex Tools Ø Coh-Metrix Ø http: //cohmetrix. com/ Ø measure a text’s cohesion (text in, lots of numbers out) Ø e. g. , LSA, Lexical Diversity, Connectives, Syntactic Complexity, Syntactic Pattern Density, Readability Ø semantic spreading activation networks Ø Ø http: //www. ntent. com ad serving platform (text in, index of concepts in the text out) what senses of words in a sentence support each other the pickup forded the river [serve Chevy ad or not? ] Ø ontological semantics Ø Ø Ø 33 www. tamuc. edu/ontology processes text in human-like fashion (text in, meaning of text out) non-probabilistic event constraint satisfaction the pickup forded the river [disambiguate, unwrap elliptic event FORD]

Summary Ø simple, easy to use, robust, but weak analysis vs. Ø complex, hard to use, brittle, (expensive), but powerful analysis Ø know what hypotheses can be addressed by what tool Ø degrees of quantitativity for inter-subjective answers Ø but don’t just go for what’s answered easily by a given tool Ø don’t be dazzled by numbers Ø collaborate multi-disciplinarily Ø Ø Ø 34 literary scholars linguists psychologists geologists …