Textbased typology Corpora corpora of elicited texts and
Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД 1
Pros as compared to questionnaires n n Contextualization of examples Naturalistic discourse Intralinguistic variation Potentially, makes up for grammar gaps 2
Frog stories n (Mercer Mayer) – e. g. Slobin on Talmy’s verb vs. satellite-framed verbs 3
Fish movie n Russel Tomlin – attention and word order
Pear stories n n n W. Chafe et al. A six-minute film shot in UC (Berkeley) in 1975 Widely used in various cross-linguistic research n referential density project 5
Referential density (Bickel 2003) n Relative frequency of overt NPs: Via Nichols 2014 6
Contras of elicited corpora n Not directly comparable n n n events focused and omitted mostly quantitative results Require massive linguistic effort n limited data for each language Any alternative? n Parallel corpora 7
Some examples of parallel corpora n Ruscorpora. ru
Massively parallel texts n Harry Potter n n Biblical translations n n State and Revolution: 71 tr in 36 lgs Legal databases: n n Pater Noster in 1300 lgs, 400 full texts, 1, 000 gospels Marxist texts n n Including subtitles (76, 21) Proceedings of the European Parliament Universal Declaration of Human Rights (329) Unesco online database of literary translations (1, 5 mln items) Andersen, Le Petit Prince, … Cysouw and Wälchli 2007 9
Comparability (easy counts) n Parallel corpora: n n Elicited texts: n n roughly comparable number of sentences (from 1, 663 to 1, 528 for Petit Prince) pear stories in the same language vary from 29 to 119 sentences (Bickel 2003 via Nichols 2014) ‘Free’ corpora: n not applicable… 10
Comparability (methodology) n Comparison by intension n definition of a phenomena browsing grammars Comparison by extension n n linguistic structures used for expressing a contextualized situation truly functional Wälchli 2007 11
Extensional typology in parallel corpora n data we work with may be linguistically different but semantically identical n n n cf. much looser identity in elicited texts rather, they are “defined as a selection of places in the parallel texts” they may reflect linguistic variation n at points where one language uses the same construction, another languages uses several 12
Parallel corpora support conventional typology n Newmeyer against Stassen n n Wälchli supports Stassen n n Classical Greek, Latin and Tibetan have the ‘exceed’ type comparative - contra Stassen 1985 A study of parallel corpora does not show ‘exceed’ but ‘separative’ construction Parallel corpora reflect dominant patterns – exactly where the typology’s primary interests lie n But they also numerically reflect variation or competition between dominant patterns, rather than provide yes or no typology 13
Case studies, among other: n n n n Wälchli 2005: co-compounds Auwera et al. 2004: epistemic poss. in Slavic Wälchli 2006: ‘again’ Wälchli 2001: motion events Wälchli & Zúñiga 2006: motion events ‘again’ Stolz 2004: total reduplication Stolz et al. 2005: comitatives and instrumentals Stolz et al. : absolute possessives 14
Stolz 2003, 2004 Le Petit Prince - quantitative ‘avec’-(de)cline Total-reduplication-(de)cline Does this require parallel corpora? 15
Stolz 2003, 2004 Le Petit Prince – qualitative? Puis il s-épongea le front avec un mouchoir à carreaux rouges. Then he mopped his forhead with a handkerchief decorated with red squares. Zatim obrise čelo rupčičem s crvenim kvadratima. 16
Pitfalls: data analysis n Easier than raw texts n n we know what was intended and where to look still, as any grammatical analysis by a non expert, subject to mistakes Alignment issues Anyway, same or easier than with elicited texts n Wälchli 2007 17
Pitfalls: sample bias Europe overrepresented, convenience sampling: n Europe > IE > other families n In his study of comitatives, Stolz ended up with an areal rather than sampling study 18
Pitfalls: style/variant choice n Standard language bias n n ‘Hagiolect’ effects n n ‘The sinners will-Evid not enter the heaven’ Style incomparability n n Better include texts reporting speech Bible translation are stylistically diverse Purism Wälchli 2007 19
Wälchli 2007 Pitfalls: translation bias “Incommensurability” of linguistic structures: some languages think differently… n Australian lgs prefer absolute over relative frame of reference n In Australian Gospels, occurrences of AFR are found but significantly less frequent than in natural discourse from this area n “Inert” construction – a construction that tends to be imported from the source language 20
Case study: MVC in ‘bring’ and ‘run’ events Bible-based, Bernhard Wälchli Multi-verb construction: clauses that contain more than one lexical verb BRING and RUN events may be described as MVC or “solitarizing” verbs 21
BRING and RUN events (Wälchli) Examples: Minnin ti-bouay la ban mouin. lead I Ač-i-ne little-boy def give Man pat-ăm-a (Haitian Creole) il-se kil-ĕr. (Chuvash) child-ps 3 -dat/acc I. gen to-poss 1 sg-dat take-conv come-imp 2 pl ‘… bring him unto me. ’ (solitarizing) Data usually unavailable from grammars… 22
BRING and RUN events (Wälchli) Bible-based, Bernhard Wälchli Multi-verb construction: clauses that contain more than one lexical verb BRING and RUN events may be described as MVC or “solitarizing” verbs Is there any correlation between the choice of either construction for encoding the two events? 23
BRING and RUN events (Wälchli) BRING Solit MVC Dinka, Navajo, Russian Ainu, Ewe, Khasi RUN MVC English, Guarani, Maltese Choctaw, Chuvash, Khoekhoe 24
BRING and RUN events (Wälchli) n n n RUN 165 languages (Eurasia over-represented) 18 BRING events, six RUN events Correlation between MVC in BRING and RUN is highly significant (Fisher’s test) BRING Solit MVC Solit 65 12 MVC 46 42 25
BRING and RUN events (Wälchli) n Is a language consistently MVC vs. solitarizing? n Surely not – then, is this a typological parameter at all? 26
BRING and RUN events (Wälchli) n But: the distribution is bimodal 27
BRING and RUN events (Wälchli) n If we only consider LOW and HIGH, fewer (14) languages are inconsistent 28
Case study: demonstratives Deictic demonstratives (then adverbs), Potterbased, Federica da Milano 2007 n Distance-oriented systems n n Person-oriented systems n n this near – that far this with us – that far from us Is this a real disctinction, or are these two subtypes of something more general? 29
Demonstratives (da Milano) n 48 stimuli (da Milano 2005) n n Also include reciprocal orientation of the locutors: face to face, face to back, side by side 83 occurrences of deictic demosntratives in “… and the Chamber of Secrets” n this with us – that far from us 30
Demonstratives (da Milano) One term systems: French – cela, ca (ceci not used) German – der/die/das (dieser, jener not used) 31
Demonstratives (da Milano) Two term systems: Unmarked vs. proximal – Scandinavian, English, Northen Italian Unmarked vs. distal – Polish, Russian, Czech, Hungarian, Modern Greek Dyad oriented - Catalan 32
Demonstratives (da Milano) Which demonstrative is used in ‘neutral’ contexts? ‘Tie that round the bars, ’ said Fred, throwing the end of a rope to Harry. ‘Przywiąż to do kraty’, powiedział Fred, rzucając Harry’emu koniec liny. 33
Demonstratives (da Milano) da Milano then proceeds to build a similar typology for adverbs; her conclusions are as follows: n The map of adverbs is by and large isomorphic to the map of pronouns n Levinson 2004 “perhaps one can hazard the generalizations that speaker-centered degrees of distance are usually (more) fully represented in the adverbs than the pronominals” confirmed n “It has turned out to be fruitful to use parallel texts as a control test of data obtained through the questionnaire. The results from the parallel texts mainly confirmed the prior typological generalizations. ” 35
‘Free’ corpora! n n No translations – no risk of inert categories, closer to naturalistic Massive amounts of texts n n Usually – literary Vast playground for quantitative analysis 36
‘Free’ corpora! Examples: n Combinatorial statistics for property words n n Lexical typology by Lex. Typ Comparative occurrences n May be useful – cf. temperature domain 37
ta me a om ld Temperature scale ica or non-t ph tactile t‘ež actile šog in tak’ ĵerm gol paġ saŕǝ zov hov c’urt 38
Comparison: texts in typology n Free corpora: n n n Elicited texts: n n n No ‘meaning identity’, shift towards intensional typology Massive collections: almost all kinds of phenomena But a shift towards intensional typology Natural discourse Weak ‘meaning’ identity Massive effort for transcription, poor collections Only frequent phenomena Natural discourse (with provisos) Parallel corpora: n n Strong ‘meaning’ identity Natural written discourse (with provisos) 40
Summary (obvious): n Corpora have their limitations and can not substitute conventional methods – but can go hand in hand with them 41
- Slides: 40