From Linguistic Insights to Theoretical Relevance ChuRen Huang

  • Slides: 57
Download presentation
From Linguistic Insights to Theoretical Relevance 語料觀察與理論詮釋 Chu-Ren Huang Academia Sinica http: //cwn. ling.

From Linguistic Insights to Theoretical Relevance 語料觀察與理論詮釋 Chu-Ren Huang Academia Sinica http: //cwn. ling. sinica. edu. tw/huang. htm

Or: How to avoid getting deleted? Old linguists don’t die, they just get obligatorily

Or: How to avoid getting deleted? Old linguists don’t die, they just get obligatorily deleted. -Title of Eva Hajicova’s talk given at her speech when accepting the Association of Computational Linguistics Life Achievement Award in July 2006 -From an old Indiana University Linguistics Club T-Shirt 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

How to avoid getting deleted I n. I have data, lots of data! And

How to avoid getting deleted I n. I have data, lots of data! And some linguistic generalizations too. 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

How Much Data? Comparing Chinese Corpora Online Year Corpus Name Sinica 4. 0 (Taiwan)

How Much Data? Comparing Chinese Corpora Online Year Corpus Name Sinica 4. 0 (Taiwan) Sinica 5. 0 (Taiwan) Sinorama (Taiwan) CCL (Peking) Data : million words/characters 5. 2 M words 1996 7. 9 M characters 10 M words 3. 2 M English words 2003 2007. 06. 02 NCKU, Tainan 5. 3 M Chinese characters 85 M simplified characters NCL 2007 1990 -1996 Fully Tagged 1990 -2004 Fully Tagged 2006 2003 Duration/ Content Chu-Ren Huang 1976 – 2000 (1999 -2000) Aligned 1919 -2003 Partially tagged (1 million)

How to find generalizations from Corpus I n Size Matters n Observable facts are

How to find generalizations from Corpus I n Size Matters n Observable facts are directly proportional to corpus size n An average collocation requires corpus size of 1, 000, 000 (10, 000 x 10) Hence the Gigaword Corpus 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

How Much Data Can One have? Chinese Gigaword Corpus First Edition New in Second

How Much Data Can One have? Chinese Gigaword Corpus First Edition New in Second Edition 2007. 06. 02 NCKU, Tainan CNA Xinhua 19912002 19902002 Zaobao Oct. 2002 Jan. 2003 Oct. 2000 Dec. Sep. 2004 2003 NCL 2007 Chu-Ren Huang

Size of Chinese Gigaword Corpus Resource First Edition Second Edition Characters / million Words/

Size of Chinese Gigaword Corpus Resource First Edition Second Edition Characters / million Words/ million Document s CNA 735 462 1, 649 Xinhua 382 252 817 TOTAL 1, 118 714 2, 466 CNA 792 497 1, 769 Xinhua 471 310 992 Zaobao 28 18 41 TOTAL 1, 291 825 2, 803 Unit: Million 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Size of Lexicon Extracted from Chinese Gigaword Corpus Word Type Word Token CNA 1,

Size of Lexicon Extracted from Chinese Gigaword Corpus Word Type Word Token CNA 1, 917, 093 496, 465, 879 XIN 1, 409, 747 305, 595, 420 ZBN 273, 111 18, 328, 571 Total 2, 999, 590 820, 389, 870 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

How to find generalizations from Corpus II n Numbers (and tendencies) don’t lie n

How to find generalizations from Corpus II n Numbers (and tendencies) don’t lie n n A large enough sample allows us to observe the tendencies (distributional patterns) of language use Linguistic Knowledge can be used to generate more linguistic knowledge n In order to extract grammatical relations automatically and efficiently, we applied detail lexical grammatical knowledge, especially argument structures and collocating patterns of verbs 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Finding a Word’s Company: Corpus Key. Word In Context (KWIC) and the color pen

Finding a Word’s Company: Corpus Key. Word In Context (KWIC) and the color pen approach 1 political association 2 social event 3 group of people 4 person in an agreement/dispute 5 to be party to something. . . The coloured pens method 2007. 06. 02 NCKU, Tainan NCL 2007 from Kilgarriff et al. 2005 Chu-Ren Huang

A Word’s Company Automatically Detected: Word. Sketch w BNC Data 2007. 06. 02 NCKU,

A Word’s Company Automatically Detected: Word. Sketch w BNC Data 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Sketch Engine and Chinese Word. Sketch Engine http: //www. sketchengine. co. uk Developed by

Sketch Engine and Chinese Word. Sketch Engine http: //www. sketchengine. co. uk Developed by team led by Adam Kilgarriff n A new corpus viewing tool n Discovering grammatical information from a gigantic corpus n Chinese Wordsketch by Academia Sinica http: //www. ling. sinica. edu. tw/wordsketch (free license for use in Taiwan only) n n Academia Sinica, Taiwan (Huang, Smith, Ma, Simon 黃居仁,史尚明,馬偉雲,石穆) 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Word. Sketch’s Approach: From Lexical Types to Relations Types n BNC has 100, 000

Word. Sketch’s Approach: From Lexical Types to Relations Types n BNC has 100, 000 Words n n 939, 028 word types 70, 000 tuples (relations) Extracted More than 70 relations per lemma For CWS II, and CGW corpus (CNA data) n n n 1, 917, 093 word Types 59, 183, 238 tuples (<eat, obj, rice>) More than 30 relations per lemma 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Linguistic Knowledge in Use Example tuple (<eat, obj, rice>) n n n Challenge: Long-distance

Linguistic Knowledge in Use Example tuple (<eat, obj, rice>) n n n Challenge: Long-distance relations 全穀麵包,吃了很健康。 quan. gu mian. bao, chi le hen jian. kang 有人嘗試要將這荷花分類,卻越分越累。 you ren chang. shi yao jiang zhe he. hua fen. lei, que yue fen yue lei n 他 只 吃了 一 口 Ta zhi chi let yi kou 2007. 06. 02 NCKU, Tainan NCL 2007 飯 … fan Chu-Ren Huang

Applying Linguistic Knowledge in Processing Knowledge Source n Information-based Case Grammar (ICG, Chen and

Applying Linguistic Knowledge in Processing Knowledge Source n Information-based Case Grammar (ICG, Chen and Huang 1992) n Encoded on over 40, 000 verbs in Sinica Lexicon n ICG Basic Patterns for Stative Pseudo-transitive Verb (VI) EXPERIENCER<GOAL[PP[對]]<VI EXPERIENCER<VI<<GOAL[PP[於]] THEME<GOAL[PP{對、以}]<VI THEME<VI<<GOAL[PP[於]] THEME<VI<<SOURCE[PP{自、於}] THEME< SOURCE[PP{歸、為}]<VI 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Linguistic Knowledge in Action n 村莊(object) 明天將 被 夷為平地(VB 11) cunzhuang mingtian jiang bei

Linguistic Knowledge in Action n 村莊(object) 明天將 被 夷為平地(VB 11) cunzhuang mingtian jiang bei yiweipingdi n n begin time 1 location time 1 adv? passive_prep adv_string 1: "V[BCJ]. *" [tag!="DE"] 大量 的 遊客 破壞(VC 2) 公園 景觀(object) daliang de youke pohuai gongyuan jingguan n n 1: "VC. *" (particle|prep)? NP not_noun (NP is defined as “…noun_modifier{0, 2} 2: noun…”. 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Word. Sketch Example: 「發票invoice」 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Word. Sketch Example: 「發票invoice」 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Mining Cross-Strait Lexical Difference n Strategy: Using a pair of know contrasting words as

Mining Cross-Strait Lexical Difference n Strategy: Using a pair of know contrasting words as seeds and lookup Sketch. Difference n n n Clinton 克林頓 ke 4 Vs. 柯林頓 ke 1 What is found: Other unique translation for either PRC or Taiwan 克林頓 (PRC) only and/or patterns (vs柯林頓 only) 葉利欽 88 54. 6 Yeltin 葉爾勤 (3) 布什65 49. 7 Bush 布希 (4) 萊溫斯基 10 41. 3 Lewinsky 呂茵斯基 /呂女 (1) 戈爾 20 39. 4 Gore 高爾 (2) 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

The question is, n What are we (linguists) going to do with so much

The question is, n What are we (linguists) going to do with so much data n How to analyze and map it promptly 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

How to avoid getting deleted II n. I have theory, fancy theory! And some

How to avoid getting deleted II n. I have theory, fancy theory! And some linguistic generalizations too. 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Grammatical Theories: A Selective Survey Early Influences n Structuralism n (Mainstream) Chomskyan Generative Grammars

Grammatical Theories: A Selective Survey Early Influences n Structuralism n (Mainstream) Chomskyan Generative Grammars n Other Generative Grammars n ‘Functional’ Grammars n 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Grammatical Theories: Early Influences n Rene Descarts’s rationalism (what Chomsky calls ‘Cartesian Linguistics’ n

Grammatical Theories: Early Influences n Rene Descarts’s rationalism (what Chomsky calls ‘Cartesian Linguistics’ n n Generalizations about the world can only be proved by rational reasoning (the famous Cogito ergo sum ‘I think, therefore I am. ’ The Young Grammarian of Leipzig in the 19 th century (The Neogrammarians, German Junggrammatiker) n n Regularity of sound changes No exception tot grammatical rules 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Grammatical Theories: Structuralism De Saussure: The Sign/Signified Dichotomy n American Structuralism: Jespersen, Bloomfield n

Grammatical Theories: Structuralism De Saussure: The Sign/Signified Dichotomy n American Structuralism: Jespersen, Bloomfield n n n Parallel development of Relativism European Structuralism n n The Prague Circle Dependency, Theme/Rheme 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Grammatical Theories: (Mainstream) Chomskyan Generative Grammars n Early Transformational Grammar n n n Generative

Grammatical Theories: (Mainstream) Chomskyan Generative Grammars n Early Transformational Grammar n n n Generative Semantics n n n Derivational rules: D- and S-Structure Island Conditions Kill = cause to die [with structural description] Government and Binding: Parameter setting The Minimalist Program n Principle and Parameter 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Grammatical Theories: Other Generative Grammars n Relational Grammar: David Perlmutter n n Case Grammar:

Grammatical Theories: Other Generative Grammars n Relational Grammar: David Perlmutter n n Case Grammar: Charles Fillmore n n n 1, 2, 3 and Chomeur Agent, Patient, etc. Frame. Net, and Construction Grammar GPSG: Generalized Phrase Structure Grammar: Gazdar, Klein, Sag, Pullem 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Grammatical Theories: Other Generative Grammars II n Head-driven Phrase Structure Grammar (HPSG): Pollard and

Grammatical Theories: Other Generative Grammars II n Head-driven Phrase Structure Grammar (HPSG): Pollard and Sag n Derived from GPSG Lexical-functional Grammar (LFG): Bresnan, Kaplan n Tree-Adjoin Grammar: Joshi n Categorical Grammar: Montague Semantics n n Categorical Unification Grammar 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Grammatical Theories: Functional Grammars ? Construction Grammar: Goldberg, Fillmore, Kay n ? Role and

Grammatical Theories: Functional Grammars ? Construction Grammar: Goldberg, Fillmore, Kay n ? Role and Reference Grammar: Van Valin n Cognitive Grammar: Langacker n 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Back to the Basics: What can a grammatical theory explain Aristotle’ four causes of

Back to the Basics: What can a grammatical theory explain Aristotle’ four causes of reason and Pustejovsky’s generative lexicon n The qualia structure of linguistic theory n n Formal Constitutive Agentive Telic 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Linguistics: Systematic knowledge of language n What is in the system? n n Formal:

Linguistics: Systematic knowledge of language n What is in the system? n n Formal: from Sign to Structure, Structuralism Constitutive: from IA to IP, rule and transformation based theories PS Grammars, TGs, Case Grammar, GPSG, LFG etc. Agentive: How grammar come to being, UG approaches GB/PP, Evolution of language Telic: Function and Use based Theories 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Why is My Theory Deleted? None of the existing theory attempt to account for

Why is My Theory Deleted? None of the existing theory attempt to account for all aspects of language use n Most are designed to explain only one aspect or two n Linguists often uses arguments from one aspect to ‘delete’ other theories which are NOT designed to account for that aspect n 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

How can we save theories from deletion? There must be an underlying and unifying

How can we save theories from deletion? There must be an underlying and unifying premise motivating all four aspects n Which will allow synergy and integration of theories with different perspectives n I propose that the premise is Language and a Knowledge System n n Which is shared by all aspects 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Towards Language as Knowledge System A complete theory to account for language as an

Towards Language as Knowledge System A complete theory to account for language as an integral sum of the four aspects. n Atoms of knowledge:lexicalized concepts n ‘frames’ of knowledge:lexical semantic relations n Instantiation of knowledge:corpus n 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Examples of our approach to a language’s knowledge system n Two Verbs of Ingestion:

Examples of our approach to a language’s knowledge system n Two Verbs of Ingestion: chi 1吃 vs. he 1喝 (Huang and Hong 2006, Hong et al. 2007) n The Shakespearean Garden approach to Tang Poem (Huang et al. 2004) n Other On-going Work n n Conventionalized Ontology of Chinese Writing System From Basic Lexicon to Linguistic Ontology 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

What can direct human experience tell us about conceptualization? n Direct human experience is

What can direct human experience tell us about conceptualization? n Direct human experience is more concrete and involves less abstraction n n Hence may allow more idiosyncrasies The introduction of a larger corpus and more powerful tool also allows us to obtain a finer grained picture of the collocational patterns n Beyond the liquid/solid food contrast 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Data Collection and Processing Extract all instances of chi 1 and he 1 from

Data Collection and Processing Extract all instances of chi 1 and he 1 from the tagged Chinese Gigaword Corpus (53, 654 chi 1, 19, 561 he 1) n Use Chinese Word Sketch to extract and rank saliency of all collocating grammatical patterns n Use Sketch. Diff to compare the gram. patterns of these two verbs n 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Initial Observations: 吃 vs. 喝 n Chi 1 takes solid objects n n He

Initial Observations: 吃 vs. 喝 n Chi 1 takes solid objects n n He 1 takes liquid objects n n n e. g. fruit, food, dinner e. g. tea, coffee, wine This should offer a classical case of selectional restriction Corpus data: many significant counter-examples n n chi 1 nai 3 shui 3 (to eat milk) he 1 xi 1 fan 4 (to drink porridge) 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Chi 1 ’eat’ only patterns 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Chi 1 ’eat’ only patterns 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

He 1 ’drink’ only patterns 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

He 1 ’drink’ only patterns 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Significant Contrast Among Shared Grammatical Patterns 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren

Significant Contrast Among Shared Grammatical Patterns 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

CWS analysis Table 1: The common patterns for chi 1’eat’ and he 1’drink’ 2007.

CWS analysis Table 1: The common patterns for chi 1’eat’ and he 1’drink’ 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Why Classical Selectional Restriction Theory Fails n Selectional restriction: verb selects objects with desirable

Why Classical Selectional Restriction Theory Fails n Selectional restriction: verb selects objects with desirable features n The features: [+/- food], [+/- solid], [+/- liquid] n The collocational patterns extracted from Chinese Word Sketch suggests otherwise n Neutralized selection: Metaphorical and metonymic extensions point to the inadequacy of a feature checking account 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Xi 1 -Fan 4 (Porridge) n Porridge: A kind of food n The patient:

Xi 1 -Fan 4 (Porridge) n Porridge: A kind of food n The patient: Both [+solid] and [+liquid] substances n The role type is ambivalent n It contains two important ingredients: Rice (solid) in soup (liquid) 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Xi 3 -Jiu 3 (Wedding Banquet) Wedding Banquet: A metonymical extension n Metaphorical: Eventive

Xi 3 -Jiu 3 (Wedding Banquet) Wedding Banquet: A metonymical extension n Metaphorical: Eventive and coerced n n From entity-type patient to event-type patient [+solid] and [+liquid] attributes inherited Include sub-events of eating solid food and drinking wine 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Nai 3 -Shui 3 (Milk) n Interesting Fact: The neutralization of chi 1/he 1

Nai 3 -Shui 3 (Milk) n Interesting Fact: The neutralization of chi 1/he 1 milk depends on the subject. It is possible only when subject is an infant or small child n n n Contrary to the common believe that subjects play no roles in selectional restriction The solid/liquid food contrast is a direct human experience after a child is weaned MARVS theory n n chi 1/he 1 nai 3 shui 3 subject-internal attributes The Agent Role: [INGEST OBJECT [+/- Solid]] 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Shakespearean-garden Approach to Specific Ontology: Tang Poems 唐詩三百首文本超過 150, 000 字 n 全部斷詞並分類 n

Shakespearean-garden Approach to Specific Ontology: Tang Poems 唐詩三百首文本超過 150, 000 字 n 全部斷詞並分類 n 整理出三個不同領域的詞表 (共計176 個詞種) 動物 (64 個詞種) n 植物 (59個詞種) n 人造物 (53個詞種) bow. sinica. edu. tw/ont/ts 300_ont. html n 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

唐詩三百首中動物概念的分佈 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

唐詩三百首中動物概念的分佈 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

唐詩三百首中植物概念的分佈 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

唐詩三百首中植物概念的分佈 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Conclusion I: Linguistic Insights Corpus as Evidence n Corpora greatly extend linguists’ range of

Conclusion I: Linguistic Insights Corpus as Evidence n Corpora greatly extend linguists’ range of observations n n Even across the barrier of time Language as an living organism allows variations and adaptations (the evolutionary view) The coherence of language is the shared tendency of all users Distributional data in corpus lead to discovery of these shared tendencies n This should be more valuable than incidental example 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang

Conclusion II: Theoretical Relevance Language as a Knowledge System n n n The generative

Conclusion II: Theoretical Relevance Language as a Knowledge System n n n The generative lexicalist approach to grammar: language as a knowledge system All aspects of Language are projected from a unified knowledge system Linguists need to make our theoretical account relevant to the world of knowledge and use, not vice versa n Failing to do so will get us/our theories deleted 2007. 06. 02 NCKU, Tainan NCL 2007 Chu-Ren Huang