Complex phenomena deserve complex explanations choosing how to

Overview • • • Introduction & Background aka ’Theory’ Goals, Corpora, & ’Quantitative’ Methods

Theory and concepts • contextuality of usage and meaning: ”You shall know a word

Introduction – Modelling of lexical choice in computational theory • In the case of

Factors influencing lexical choice on the syntactic-semantic level • (mainly) lexicographically motivated corpus-based studies

Critical assessment of these results – monocausality • The mentioned studies are typically monofactorial/monocausal,

Critical assessment – dichotomous setups • The mentioned studies concern typically synonym pairs instead

Subsequent goals, methods and corpora in this study • Explore and develop corpus-based and

Goals (cont’d) … – Extend from (simple) monofactorial to (complex) multifactorial models of explanation

Goals (cont’d) • Extend from traditional written corpora such as newspapers or published literature

Current descriptions of Finnish THINK verbs • Pajunen: Verbien argumenttirakenne ‘Argument Structure of [Finnish]

Current descriptions. . . • Pajunen (2001: 313 -319) – [käsittää], ajatella: • x-arg:

Current descriptions. . . – Arguments • x-arg: – human referent • y-arg: –

Current descriptions (Perussanakirja ’Standard dictionary of Finnish’) • ajatella 67*C (other members of the

Current descriptions. . . • miettiä 61*C (other members of the THINK group marked

Monofactorial results • • • 5771 (syntactic) contextual features observed altogether in the two

Multifactorial results – selection of a heuristic method for polytomous regression • logistic regression

Comparison of the pros and cons of the various heuristic methods Heuristic + –

Comparison (cont’d). . . Heuristic + – nested dichotomies Nmodels=Nlexemes-1 Provides direct probabilities for

Evaluation results – one vs. all • • five rounds with 2/3 random holdout

Evaluation – double-round-robin Recall. Total Mean 735. 0 Std. Dev 20. 4 tau (Kendall)

Evaluation – nested dichotomies (a, (m, (p, h))) (a, (p, (m, h))) (a, (h,

Evaluation – ensemble of nested dichotomies Recall. Total Mean 733. 6 Std. Dev 9.

Evaluation – comparison of results • prediction accuracy – one-vs-all: 64. 71% – double-round-robin:

Results - model coefficients • See Appendix 1 (at the end) for the full

Model coefficients – comparison – AGENT Method/Verb one-vs-all round-robin ajatella -GROUP -2 ND, -3

Model coefficients – comparison – PATIENT Method/Verb one-vs-all round-robin ajatella +INDIVIDUAL, +GROUP, -NOTION, -ATTRIBUTE,

Discussion • doubling the number of features (51 -> 106) to be included in

Further issues • the remaining, not insignificant proportion (~35%) of incorrect predictions with all

Conclusions • A wide range of different linguistic (morphological, lexical, syntactic and semantic) and

Appendix 1 – Full results of aggregated model coefficient preferences with one-vs-all heuristic Morphological

Appendix 1 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_AGE. SEM_INDIVIDUAL 0 0

Appendix 1 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_MAN. SEM_ALONE (-) 0

Appendix 1 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_RSN (-) 0 0

Appendix 1 – cont’d Extralinguistic features ajatella miettiä pohtia harkita Z_EXTRA_DE_hs 95_KA 0 (-)

Appendix 2 – Full results of aggredated model coefficient preferences with double-round-robin heuristic Morphological

Appendix 2 – cont’d Syntactic arguments ajatella miettiä pohtia harkita 0 0 --(-) +

Appendix 2 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_MAN. SEM_CONCUR (+) (-)

Appendix 2 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_RSN (-) 0 (+)

Appendix 2 – cont’d Extralinguistic features ajatella miettiä pohtia harkita Z_EXTRA_DE_hs 95_KA - -

Slides: 40

Download presentation

Complex phenomena deserve complex explanations – choosing how to think in Finnish Antti Arppe University of Helsinki QITL 2, Osnabrück, 2. 6. 2006

Overview • • • Introduction & Background aka ’Theory’ Goals, Corpora, & ’Quantitative’ Methods Results Discussion Conclusions

Theory and concepts • contextuality of usage and meaning: ”You shall know a word by the company that it keeps!” (J. R. Firth) – words are in a structural and semantic relationship with others in their context – the choice (i. e. usage) and meaning of words is interconnected with their context – in a language with a free word order such Finnish, (functional) dependency grammar (Tesnière) is a practical way to explore such strucural relationships • non-modularity of language – constructionality – regularities in co-occurrence and structure can be observed at a continuum of levels from individual words and synonym groups to general semantic groupings or parts-of-speech -> Construction Grammar • synonymy – some word pairs or groups have relatively similar meanings – in some contexts such words can be interchanged with each other without an essential change in the meaning of the entire utterance

Introduction – Modelling of lexical choice in computational theory • In the case of semantically similar words, especially near-synonyms, at least three levels have been suggested (Edmonds and Hirst 2002) 1) conceptual-semantic level 2) subconceptual/stylistic-semantic level, and 3) syntactic-semantic level

Factors influencing lexical choice on the syntactic-semantic level • (mainly) lexicographically motivated corpus-based studies show differences in the use of semantically similar words, i. e. synonyms, in e. g. their: 1) lexical context • e. g. English powerful vs. strong in Biber et al. 1998 2) syntactic structures of which they form part of • e. g. English begin vs. start in Biber et al. 1998 3) semantic classification of some particular argument • e. g. English shake verbs in Atkins & Levin 1996 4) style-associated text type, in which they are used • • e. g Biber 1998 while the above studies have focused on English, with minimal morphology, it has also been shown in languages with extensive morphology such as Finnish that similar differentiation is evident 5) wrt the inflectional forms and the associated morphosyntactic features in which synonyms are used • • Finnish miettiä and pohtia ‘think, ponder, reflect, consider’ in Arppe and Arppe & Järvikivi 2002 tärkeä vs. keskeinen ‘important, central’ in Jantunen 2002

Critical assessment of these results – monocausality • The mentioned studies are typically monofactorial/monocausal, focusing on one linguistic category or one feature within a category (at a time) – HOWEVER Jantunen (2002) does go through a wide range of categories, but does not quantitatively evaluate their interactions – With justification, Gries (2003) has argued convincingly for a holistic approach using multifactorial (i. e. multivariate) statistical methods • HOWEVERm these multivariate methods build upon univariate and bivariate analysis

Critical assessment – dichotomous setups • The mentioned studies concern typically synonym pairs instead of groups with more than two members – powerful vs. strong, start vs. begin, miettiä vs. pohtia, tärkeä vs. keskeinen – BUT ALSO Gries’ own study of particle placement concerns a dichotomous choice between two alternative constructions – this has been noted earlier by also Divjak and Gries (forthcoming), motivating their exceptional study of nine Russian verbs meaning ’try’ • However, lexicographical reality, clearly evident in both dictionaries and in language use, often indicates that there are more than just two members to a synonym group – THOUGH full interchangability for more than two synonyms may be prima facie rarer, there are probably at least some contexts where any one in a group of more than two synonyms could be substituted with each other without a major reservation – Consequently, the differences observed between some pair might change, diminish or even disappear when studied within the entire group

Subsequent goals, methods and corpora in this study • Explore and develop corpus-based and statistical (quantitative) methodology with an aim to: – Extend from dichotomous to polytomous (more than two) setups • Inclusion of other members of the THINK synonym groups, with roughly similar magnitudes of frequency (common translations in boldface): – ajatella: 1 intend 2 plan 3 imagine, fancy, conceive (conceive of sth) 4 ponder 5 reflect 6 think, think of, give a thought to, figure 7 consider 8 take from some perspective 9 regard, make of (sth) – miettiä: 1 think 2 meditate, ponder (meditate on sth) 3 reflect 4 contemplate, conceive (conceive [of] sth), consider, mull [over], wonder (wonder about sth), give a thought to muse, cast about for 5 think twice, thoroughly – pohtia: 1 deliberate, consider, ponder, think over 2 contemplate, discuss (discuss sth), debate, talk over, puzzle, think in terms of 3 wonder (wonder about sth) 4 turn over, chew over 5 kick around / about 6 (think out loud) talk about – harkita: 1 contemplate 2 ponder, deliberate, think over 3 weigh, weigh up 4 consider, think of, think in terms of 5 think, entertain 6 think out 7 be considering [doing sth]

Goals (cont’d) … – Extend from (simple) monofactorial to (complex) multifactorial models of explanation of lexical choice • Inclusion of all practically available linguistic and extra-linguistic contextual information – Morphological features and inflectional structure – Syntactic arguments (according to dependency grammar as implemented in the Functional Dependency Parser for Finnish by Connexor, influenced by Tesnière 195 X) – Semantic classifications of syntactic arguments (according to Word. Net in the case of nominal lexemes and loosely adapting semantic primitives of Wierzbicka in the case of non-nominal adverbs) • Building upon Gries’ framework (2003) of combining various statistical methods – X 2 test, Cramér's V, lambda (Goodman-Kruskal), correlation and uncertainty coefficient (UC, Thiel) for discovering significant individual features – Regression analysis for studying the simultanous influence and interaction of significant features

Goals (cont’d) • Extend from traditional written corpora such as newspapers or published literature (formal, standardized and monologic in nature) to more informal material with a dialogic character – In addition to two months of Helsingin Sanomat, Finland’s largest daily newspaper from January-February 1995 • 3, 3 M words with 1750 instances of the studied verbs – Inclusion of six months of Finnish Internet discussion group material from 2002 -2003 • sfnet. keskustelu. ihmissuhteet (human relationships) and sfnet. keskustelu. politiikka (politics) • 400 K words with 1654 instances of the studied verbs – Newspaper/Newsgroup section, author and quotation/body information available from both sources as extralinguistic context – In addition, various aspects of repetion were also included as extralinguistic context (first use within article/posting, repetition of the preceding verb, individual preceding verbs of the same group)

Current descriptions of Finnish THINK verbs • Pajunen: Verbien argumenttirakenne ‘Argument Structure of [Finnish] Verbs’ (2001: 62 -63) – “Primary-B verbs [i. e. mental verbs], with the exception of speech verbs and some descriptive perception verbs, in general have a flat [classificatory] structure. … In classes with very flat structure these relationships [hyponym-hypernym] are rare and classificatory structure consists of minor sets which are in loose co-hyponymic relationhips to each other (i. e. contrast groups)

Current descriptions. . . • Pajunen (2001: 313 -319) – [käsittää], ajatella: • x-arg: subject A: ab: • y-arg: object, clause argument=subordinate clause, participle, infinitive • A: gentivity: volitional participation in state or event, sensing and/or perceiving – harkita • x-arg: subject: A: (a)b: y-arg: object, clause argument • Agentivity: (volitional participation in state or event), sensing and/or perceiving

Current descriptions. . . – Arguments • x-arg: – human referent • y-arg: – abstract notion > concrete object, stateof-affairs, human referent • z-arg: – goal: stimulus (“world-to-mind”), – result (“mind-to-world”)

Current descriptions (Perussanakirja ’Standard dictionary of Finnish’) • ajatella 67*C (other members of the THINK group marked in boldface) – 1. yhdistää käsitteitä ja mielteitä tietoisesti toisiinsa (us. jnk ongelman ratkaisemiseksi), miettiä, harkita, pohtia, tuumia, järkeillä, päätellä, aprikoida, punnita. Ajatella loogisesti, selkeästi. Lupasi ajatella asiaa. Olen ajatellut sinua. Ajatella jkta pahalla. En tullut sitä ajatelleeksi. Tapaus antoi ajattelemisen [= vakavan harkinnan] aihetta. Ajatella ääneen puhua itsekseen. – 2. asennoitua, suhtautua, olla jtak mieltä jstak, arvella. Samoin, toisin ajattelevat. Porvarillisesti ajattelevat kansalaiset. Mitä ajattelet asiasta? Ajattelin, että olisi parasta luopua hankkeesta. – 3. kuvitella, olettaa, pitää mahdollisena, otaksua. Suoran ajateltu jatke. Tauti, jonka aiheuttajaksi on ajateltu virusta. Ajatellaanpa, että - -. Paras ajateltavissa oleva. Pahinta, mitä ajatella saattaa. – 4. kiinnittää huomiota jhk, ottaa jtak huomioon, pitää jtak silmällä, mielessä. Ajatella omaa etuaan, toisten parasta. Toimia seurauksia ajattelematta. Paras vaihtoehto tulevaisuutta ajatellen paremmin: tulevaisuuden kannalta. – 5. harkita, aikoa, suunnitella, tuumia. Ajatteli jäädä eläkkeelle, eläkkeelle jäämistä. Tehtaan paikaksi on ajateltu Torniota. – 6. vars. ark. huudahduksissa huomiota kiinnittämässä t. sanontaa tehostamassa. Ajatteles, mitä sillä rahalla olisi saanut! Ajatella, että hän on jo aikuinen!

Current descriptions. . . • miettiä 61*C (other members of the THINK group marked in boldface) – 1. ajatella, harkita, pohtia, punnita, tuumia, aprikoida, järkeillä, mietiskellä. Mitäpä mietit? Asiaa täytyy vielä miettiä. Mietin juuri, kannattaako ollenkaan lähteä. Vastasi sen enempää miettimättä. Miettiä päänsä puhki. – 2. suunnitella; keksiä (miettimällä). Miettiä uusia kepposia. Oli miettinyt hyvän selityksen. • pohtia 61*F – ajatella jtak perusteellisesti, eri mahdollisuuksia arvioiden, harkita, miettiä, tuumia, ajatella, järkeillä, punnita, aprikoida. Pohtia arvoitusta, ongelmaa. Pohtia kysymystä joka puolelta. Pohtia keinoja asian auttamiseksi. • harkita 69 (harkitsematon, harkitseva, harkittu ks. erikseen) – 1. ajatella perusteellisesti, eri mahdollisuuksia arvioiden, pohtia, punnita, puntaroida, miettiä; suunnitella. Harkita ehdotusta, tilannetta. Asiaa kannattaa harkita. Ottaa jtak harkittavaksi, harkittavakseen. Asiaa tarkoin harkittuani päätin - -. Lääkkeitä on käytettävä harkiten. Yhtiö harkitsee toiminnan laajentamista. – 2. päätyä jhk perusteellisen ajattelun nojalla, tulla jhk päätelmään, katsoa jksik. Harkitsi parhaaksi vaieta. Sen mukaan kuin kohtuulliseksi harkitaan. Näin olen asian harkinnut.

Monofactorial results • • • 5771 (syntactic) contextual features observed altogether in the two corpora 340 features with statistically significant differences in their distributions among the four studied verbs (83 morphological features, 208 syntactic argument+semantic/morphological features, 43 syntactic argument+lexemes, and 18 extralinguistic features) Some statistically significant individual lexemes as syntactic arguments without a semantic classification inspired an additional classification of semantically similar arguments, e. g. – tarkka <- tarkkaan, tarkoin ’careful/meticulous’ ->vakavasti ’seriously’, oikeasti ’really/earnestly’, perusteellisesti ’thoroughly’, tarkasti ’thoroughly’, huolellisesti ’carefully’, syvään <- syvä ’in depth’ -> SX_MAN. SEM_THOROUGH in MANNER, or – vielä ’still’, enää ’anymore’, edelleen ’ever still’, jo ’already’, yhä ’evermore’, edelleen <edelle ’ ever still’, erää <- erä_’for now’ [ei] koommin ’[not since]’, vasta ’just (since a short for while)’ -> SX_DUR. SEM_OPEN in DURATION • Of the statistically significant features – 185 were logically associated with some other so they were excluded – 39 correlated with another to the extent that the other was discarded from further analysis • Resulting in 106 features for further analysis (37 morphological features, 51 syntactic argument+semantic features, 15 syntactic argument+lexemes, and 17 extralinguistic features)

Multifactorial results – selection of a heuristic method for polytomous regression • logistic regression can be extended from dichotomous to polytomous cases with several heuristics, which are based on a set of dichotomous logistic regression models (those explicitly observed in this study marked in boldface) – one vs. rest (Rifkin & Klautau 2004) • ajatella vs. miettiä+pohtia+harkita, miettiä vs. ajatella+pohtia+harkita, . . . – double-round-robin aka pairwise (Fürnkranz 2002) • ajatella vs. miettiä, ajatella vs. pohtia, . . . , miettiä vs. ajatella, . . . – nested dichotomies • e. g. ajatella vs. (miettiä vs. (pohtia vs. harkita) or (ajatella vs. miettiä) vs. (pohtia vs. harkita) – ensemble of nested dichotomies, i. e. ENDs (Frank & Kramer 2004) • an aggregate of (a random selection of) all possible nestings – multinomial logistic models • a single logistic model • presuppose a baseline, e. g. most frequent or most prototypical one (in this case ajatella) vs. the rest

Comparison of the pros and cons of the various heuristic methods Heuristic + – one-vs-rest Nmodels=Nlexemes Provides direct probabilities -> lex(max(P(context))) can be selected in evaluation Highlights features separating one lexeme from the rest May not uncover a distinguisting feature which is approximately equally common also with another lexeme double-round-robin Discovers thoroughly all pairwise differences among tlexemes Nmodels=Nlexemes*(Nlex-1) Does not provide probabilities directly -> has to rely on some heuristic, e. g. voting scheme for selection in evaluation For a lexeme contasting positively with another lexeme and negatively with a third lexeme provides a contradictory aggregate result -> may does exaggerate differences within the group as a whole

Comparison (cont’d). . . Heuristic + – nested dichotomies Nmodels=Nlexemes-1 Provides direct probabilities for each lexemes in a context Selecting one appropriate nesting may be difficult or impossible ensemble of nested dichotomies Can take into account different perspectives represented by several different nestings Provides direct probabilities for each lexeme as averages of each nesting Nmodels(Nlex)=(2 Nlex-3)*Nmodels(Nlex -1); Nmodels(1)=1 multinomial models Nmodels=1 Does not provide distinguisting features for the baseline lexeme

Evaluation results – one vs. all • • five rounds with 2/3 random holdout sample of the corpus data for training and the remaining 1/3 for evaluation 2269 training cases -> 1135 test cases Recall. Total. % Mean 734. 4 64. 71 Std. Dev 17. 9 1. 58 tau (Kendall) 0. 5461221 0. 02402176 ajatella harkita miettiä pohtia Test. Mean Test/All. % Recall. Mean Recall. % Recall. Std. Dev 493. 6 43. 49 419. 2 84. 94 15. 6 131. 2 11. 56 59. 2 45. 45 2. 4 274. 2 24. 16 126. 6 46. 24 5. 8 236. 0 20. 79 129. 4 54. 91 6. 2 Recall. Std. Dev. % Precision. Mean Precision. % 2. 36 564. 6 74. 25 4. 41 107. 8 55. 15 2. 59 234. 4 54. 04 1. 70 228. 2 56. 90

Evaluation – double-round-robin Recall. Total Mean 735. 0 Std. Dev 20. 4 tau (Kendall) 0. 5612979 ajatella harkita miettiä pohtia Recall. Total. % 64. 76 1. 79 0. 02145662 Test. Mean Test/All. % Recall. Mean 512. 8 45. 18 426. 0 119. 6 10. 54 52. 0 259. 6 22. 87 123. 4 243. 0 21. 41 133. 6 Recall. Std. Dev. % Precision. Mean 1. 50 551. 2 3. 55 110. 0 3. 98 244. 8 5. 49 229. 0 Recall. % Recall. Std. Dev 83. 05 17. 5 43. 43 5. 8 47. 53 13. 1 55. 02 12. 7 Precision. % 77. 27 47. 40 50. 41 58. 39

Evaluation – nested dichotomies (a, (m, (p, h))) (a, (p, (m, h))) (a, (h, (m, p))) (m, (a, (p, h))) (m, (p, (a, h))) (m, (h, (a, p))) (p, (a, (m, h))) (p, (m, (a, h))) (p, (h, (a, m))) (h, (a, (m, p))) (h, (m, (a, p))) (h, (p, (a, m))) ((a, m), (p, h)) ((a, p), (m, h)) ((a, h), (m, p)) Nest. Mean Nest. St. Dev. %. Nest/Test Nest. Rank 722. 4 17. 10 63. 65 7 577. 47 50. 87 15 721. 0 13. 21 63. 52 10 724. 2 13. 14 63. 81 4 729. 4 15. 81 64. 26 1 721. 6 9. 24 63. 58 8 719. 0 14. 49 63. 35 13 728. 8 12. 32 64. 21 2 726. 8 15. 51 64. 04 3 720. 8 13. 77 63. 51 12 721. 6 11. 28 63. 58 8 721. 0 12. 86 63. 52 10 723. 2 14. 32 63. 72 6 715. 8 13. 01 63. 07 14 723. 2 8. 76 63. 72 6

Evaluation – ensemble of nested dichotomies Recall. Total Mean 733. 6 Std. Dev 9. 6 tau (Kendall) 0. 5527388 ajatella harkita miettiä pohtia ajatella hark miettiä pohtia Recall. Total. % 64. 64 0. 85 0. 01366292 Test. Mean Test/All. % Recall. Mean Recall. % Recall. Std. Dev 495. 0 43. 61 422. 0 85. 27 7. 0 138. 2 12. 18 59. 8 43. 33 2. 6 263. 6 23. 22 121. 0 45. 88 9. 8 238. 2 20. 99 130. 8 55. 01 6. 1 Recall. Std. Dev. % Precision. Mean Precision. % 1. 64 567. 0 74. 42 2. 06 106. 2 56. 38 2. 62 237. 2 51. 07 3. 68 224. 6 58. 48

Evaluation – comparison of results • prediction accuracy – one-vs-all: 64. 71% – double-round-robin: 64. 76% – nested dichotomies: in 14/15 cases 63 -64% – ENDs: 64. 64% • OBVIOUSLY no significant difference in prediction performance

Results - model coefficients • See Appendix 1 (at the end) for the full results of the aggregated model preferences with the one-vs-all heuristic • See Appendix 2 (at the end) for the full results of the aggregated model preferences with the double-round-robin heuristic

Model coefficients – comparison – AGENT Method/Verb one-vs-all round-robin ajatella -GROUP -2 ND, -3 RD, -PASS ---GROUP miettiä +SING, +GROUP pohtia +PASS, -SING, +GROUP -1 ST, +2 ND, +3 RD, SING, +PASS, ++GROUP harkita -GROUP, -INDIVIDUAL +1 ST, +/-GROUP +: positive significant coefficient; -: negative significant coefficient (in each verb-vs-rest or pairwise verb-vs-verb comparison); similar coefficients marked in boldface

Model coefficients – comparison – PATIENT Method/Verb one-vs-all round-robin ajatella +INDIVIDUAL, +GROUP, -NOTION, -ATTRIBUTE, -COMMUNICATION, -ACTIVITY ++INDIVIDUAL, ++GROUP, ---NOTION, -ATTRIBUTE, ---COMMUNICATION, ---ACTIVITY miettiä +NOTION, +COMMUNICATION -INDIVIDUAL, -GROUP, +NOTION, +COMMUNICATION, +/-ACTIVITY pohtia -INDIVIDUAL, +NOTION, +ATTRIBUTE, +COMMUNICATION -INDIVIDUAL, -GROUP, ++NOTION, +ATTRIBUTE, +COMMUNICATION, +/-ACTIVITY harkita +ACTIVITY +/-NOTION, +COMMUNICATION, +++ACTIVITY +: positive significant coefficient; -: negative significant coefficient (in each verb-vs-rest or pairwise verb-vs-verb comparison); similar coefficients marked in boldface

Discussion • doubling the number of features (51 -> 106) to be included in training the multi-level models increases prediction accuracy by only less than ten percent: 58 -59% -> 64 -65%) • both one-vs-rest and double-round-robin uncover practically the same feature-lexeme associations • BUT double-round-robin brings forth essentially more distinctive features than one-vs-all • Person/Number distinctions are marginalized when considered together with other factors with the one-vs-all

Further issues • the remaining, not insignificant proportion (~35%) of incorrect predictions with all the studied methods – does this represent truly synonymous, i. e. interchangable cases, which could be explored with experimentation as in Arppe & Järvikivi 2002, or – is this a result of some still missing features in the models – how much does the aggregation process of the individual dichotomous regression models in the double-round-robin method contribute to this inaccuracy • collinearity of the factors should be scrutinized • double-round-robin method: how to aggregate the coefficient values of the various component dichotomous regression models • differences between the registers (formal i. e. newspaper vs. informal i. e. newsgroup) should be studied • morphological family size (’MFS’) effects (cf. Schreuder & Baayen 1997, De Jong 2002, Moscoso del Prado Martín et al. 2004), i. e. noun/adjective derivations of the studied verbs, should also be observed

Conclusions • A wide range of different linguistic (morphological, lexical, syntactic and semantic) and extralinguistic features appear to influence (register, repetition) in the choice of the studied synonymous verbs • Univariate, bivariate and multivariate statistical methods each play an essential role in the discovery of both these factors and their relative weights and interactions

Appendix 1 – Full results of aggregated model coefficient preferences with one-vs-all heuristic Morphological features ajatella miettiä Pohtia harkita Z_INF 1 0 + 0 0 Z_INF 2 + 0 - 0 Z_INF 3 - + 0 0 Z_INF 4 - + 0 (-) Z_PCP 1 0 0 - 0 Z_PCP 2 - + (-) 0 Z_ABE 0 + (-) 0 Z_ESS + 0 0 0 Z_INE - 0 + 0 Z_NOM 0 - (+) 0 Z_TRA + - 0 - Z_ANL_IND (+) - - Z_ANL_KOND 0 0 - 0 Z_ANL_IMP 0 + - 0 Z_ANL_NEG + - - 0 Z_ANL_PASS 0 0 + 0 Z_ANL_SING 0 + (-) 0 +: positive significant coefficient; -: negative significant coefficient (in each verb-vs-rest or pairwise verb-vs-verb comparison); (+) or (-): coefficients with a p-value. 05<p<. 1

Appendix 1 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_AGE. SEM_INDIVIDUAL 0 0 0 - SX_AGE. SEM_GROUP - 0 + (-) SX_PAT 0 0 + - SX_PAT. SEM_INDIVIDUAL + 0 - 0 SX_PAT. SEM_GROUP + 0 0 0 SX_PAT. SEM_NOTION - + + 0 SX_PAT. SEM_COMMUNICATION - + (+) 0 SX_PAT. SEM_ACTIVITY - 0 0 + SX_PAT. SEM_ATTRIBUTE - 0 + 0 SX_PAT. INFINITIVE + 0 - 0 SX_PAT. PARTICIPLE + 0 - 0 SX_PAT. DIRECT_QUOTE - + + 0 SX_PAT. INDIRECT_QUESTION - + + 0 SX_LX_että_CS. SX_PAT + 0 - - SX_LX_se_PRON. SX_PAT - + (+) 0 SX_SOU + 0 - 0 SX_GOA + 0 - - SX_GOA. SEM_NOTION 0 0 (+) 0

Appendix 1 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_MAN. SEM_ALONE (-) 0 0 0 SX_MAN. SEM_CONCUR + 0 0 0 SX_MAN. SEM_DIFFER + 0 0 0 SX_MAN. SEM_FRAME + - SX_MAN. SEM_GENERIC + - 0 0 SX_MAN. SEM_NEGATIVE + 0 (-) 0 SX_MAN. SEM_THOROUGH - 0 0 + SX_QUA. SEM_LITTLE 0 + 0 0 SX_LOC. SEM_EVENT - 0 + (-) SX_LOC. SEM_GROUP 0 + 0 (-) SX_LOC. SEM_LOCATION - 0 + (-) SX_TMP - + 0 0 SX_TMP. PHR_CLAUSE 0 (+) - 0 SX_TMP. SEM_TIME 0 0 + 0 SX_DUR. SEM_LONG - + 0 0 SX_DUR. SEM_OPEN - 0 0 0 SX_DUR. SEM_SHORT - + 0 - SX_FRQ. SEM_AGAIN (-) 0 0 + SX_FRQ. SEM_OFTEN (-) + (-) 0

Appendix 1 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_RSN (-) 0 0 0 SX_CND - 0 0 + SX_META 0 0 (-) (+) SX_LX_mukaan_PSP. SX_META 0 0 0 + SX_COMP 0 (-) + 0 SX_AAUX 0 - + 0 SX_CAUX 0 0 (-) 0 SX_CV - + 0 0 SX_LX_alkaa_V. SX_AAUX - 0 0 0 SX_LX_joutua_V. SX_AAUX - 0 0 0 SX_LX_kannattaa_V. SX_AAUX - 0 - + SX_LX_tarvita_V. SX_AAUX - + 0 0 SX_LX_voida_V. SX_AAUX 0 0 0 +

Appendix 1 – cont’d Extralinguistic features ajatella miettiä pohtia harkita Z_EXTRA_DE_hs 95_KA 0 (-) 0 + Z_EXTRA_DE_hs 95_KU 0 - + (-) Z_EXTRA_DE_hs 95_MN 0 (-) + 0 Z_EXTRA_DE_hs 95_MP + (-) 0 0 Z_EXTRA_DE_hs 95_NH 0 (+) 0 0 Z_EXTRA_DE_hs 95_PO - 0 0 (+) Z_EXTRA_DE_hs 95_TA - 0 (+) 0 Z_EXTRA_DE_hs 95_UL - 0 0 + Z_EXTRA_DE_hs 95_YO - (-) + + Z_EXTRA_DE_ihmissuhteet + + - (-) Z_EXTRA_DE_politiikka + 0 (-) - Z_QUOTE + + - 0 Z_REPEAT + 0 0 -

Appendix 2 – Full results of aggredated model coefficient preferences with double-round-robin heuristic Morphological features ajatella miettiä pohtia harkita Z_INF 1 - ++ - 0 Z_INF 2 + (+) -(-)(-) (+) Z_INF 3 - +(+) (-) 0 Z_INF 4 -(-) ++ (+) - Z_PCP 1 0 (+) -(-) + Z_PCP 2 -- +(+) -(-) ++ Z_PTV (-)- + 0 (+) Z_ABE (+) (-)(-) 0 Z_ESS ++(+) - (-) - Z_INE --- +- +(+)+ +(-) Z_NOM + - 0 0 +++ - -(+) -(-) + ++ -- - (-)+ + -- (+) Z_ANL_IMP + (+)+ -- (-) Z_ANL_NEG ++ - - 0 Z_ANL_FIRST 0 0 (-) (+) Z_ANL_SECOND - 0 + 0 Z_ANL_THIRD - 0 + 0 Z_ANL_SING 0 + - 0 Z_ANL_PASS - 0 + 0 Z_TRA ‘come to [think]’ Z_ANL_IND Z_ANL_KOND +: positive significant coefficient; -: negative significant coefficient (in each verb-vs-rest or pairwise verb-vs-verb comparison); (+) or (-): coefficients with a p-value. 05<p<. 1

Appendix 2 – cont’d Syntactic arguments ajatella miettiä pohtia harkita 0 0 --(-) + (+)+ +- SX_PAT (+)- +- +++ (-)-- SX_PAT. SEM_INDIVIDUAL +(+) - (-) 0 SX_PAT. SEM_GROUP +(+) - (-) 0 SX_PAT. SEM_NOTION --- + ++ +- - 0 + 0 (-)-- + + (+) --- +- +- +++ SX_PAT. INFINITIVE + 0 -- + SX_PAT. PARTICIPLE + 0 -- + SX_PAT. INDIRECT_QUESTION --- +++ +(+)- +-(-) SX_PAT. DIRECT_QUOTE -- +(-) +(+) 0 SX_LX_että_CS. SX_PAT ’that’ +++ - - - SX_LX_se_PRON. SX_PAT ‘consider that!’ -- +(+) + (-) SX_SOU (+)(+)+ (-) - (-) SX_GOA +++ - - - SX_AGE. SEM_INDIVIDUAL SX_AGE. SEM_GROUP SX_PAT. SEM_ATTRIBUTE SX_PAT. SEM_COMMUNICATION SX_PAT. SEM_ACTIVITY

Appendix 2 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_MAN. SEM_CONCUR (+) (-) 0 0 SX_MAN. SEM_DIFFER (+) 0 (-) 0 SX_MAN. SEM_FRAME ++ -(-) +(+) -- SX_MAN. SEM_GENERIC + - 0 0 SX_MAN. SEM_NEGATIVE (+)+ 0 - (-) SX_MAN. SEM_THOROUGH -- + 0 + SX_QUA. SEM_LITTLE - +(+) (-) 0 SX_QUA. SEM_MUCH (+) 0 0 (-) SX_LOC. SEM_EVENT - (+) ++ (-)- SX_LOC. SEM_GROUP - ++(+) (-) - SX_LOC. SEM_LOCATION -- 0 ++ +- SX_TMP -- + 0 + SX_TMP. PHR_CLAUSE + + --- + SX_TMP. SEM_TIME - - +(+)+ (-) SX_DUR. SEM_LONG -- +(+)+ - +(-) SX_DUR. SEM_OPEN --- + + + SX_DUR. SEM_SHORT -(-) ++ (+) - SX_FRQ. SEM_AGAIN -(-) (+) 0 + SX_FRQ. SEM_OFTEN - ++ - 0

Appendix 2 – cont’d Syntactic arguments ajatella miettiä pohtia harkita SX_RSN (-) 0 (+) 0 SX_CND - (-) - +(+)+ SX_META 0 0 - + SX_LX_mukaan_PSP. SX_META ’according to [someone]’ - - (-) ++(+) SX_COMP (+) - ++ (-)- SX_AAUX + -- + 0 SX_CAUX 0 (+) (-) 0 SX_CV -- ++ +- 0 +-(-) + (+)(+) -(-) + (+) 0 SX_LX_kannattaa_V. SX_AAUX ’be worth [thinking]’ -- +-+ -- +++ SX_LX_tarvita_V. SX_AAUX ’need to [think] - ++ - 0 SX_LX_alkaa_V. SX_AAUX ’start [thinking]’ SX_LX_joutua_V. SX_AAUX ’have to [think]’

Appendix 2 – cont’d Extralinguistic features ajatella miettiä pohtia harkita Z_EXTRA_DE_hs 95_KA - - (-) ++(+) Z_EXTRA_DE_hs 95_KN (-) (+) 0 0 Z_EXTRA_DE_hs 95_KU 0 - ++ - Z_EXTRA_DE_hs 95_MA - 0 0 + Z_EXTRA_DE_hs 95_MN 0 - + 0 Z_EXTRA_DE_hs 95_MP ++ - - 0 Z_EXTRA_DE_hs 95_PO --(-) + (+) + Z_EXTRA_DE_hs 95_TA -- 0 + + Z_EXTRA_DE_hs 95_UL - 0 0 + Z_EXTRA_DE_hs 95_YO -- -(-) +(+) ++ Z_EXTRA_DE_ihmissuhteet + + -- 0 Z_EXTRA_DE_politiikka ++ + - -- Z_QUOTE + + --- + Z_REPEAT +++ - - -