Linguistic annotation 2142006 Nianwen Xue Outline Tokenization segmentation

  • Slides: 56
Download presentation
Linguistic annotation 2/14/2006 Nianwen Xue

Linguistic annotation 2/14/2006 Nianwen Xue

Outline • Tokenization / segmentation, POS tagging • Treebanking w Constituent structure and structural

Outline • Tokenization / segmentation, POS tagging • Treebanking w Constituent structure and structural ambiguity w Basic grammatical relations and how argument structure is instantiated • Propbanking/nombanking w Cross-linguistic syntactic alternations, verb senses and argument structure • Others: named entity, coreference, discourse connectives 2

Tokenization • English w In the new position he will oversee Mazda ’s U.

Tokenization • English w In the new position he will oversee Mazda ’s U. S. sales , services , parts and marketing operations. w We did n’t have much of a choice. w U. S. trade officials said the Philippines and Thailand would be the main beneficiaries of the president ‘s action. w Anything ‘s possible -- how about the new Guinea Fund ? 3

Tokenization • English w In the new position he will oversee Mazda ’s U.

Tokenization • English w In the new position he will oversee Mazda ’s U. S. sales , services , parts and marketing operations. w We did n’t have much of a choice. w U. S. trade officials said the Philippines and Thailand would be the main beneficiaries of the president ‘s action. w Anything ‘s possible -- how about the new Guinea Fund ? 4

Tokenization • The federal government suspended sales of the U. S. savings bonds because

Tokenization • The federal government suspended sales of the U. S. savings bonds because Congress has n’t lifted the ceiling on government debt. • The Treasury said the U. S. will default on Nov. 9 if Congress does n’t act by then. 5

Tokenization • The federal government suspended sales of the U. S. savings bonds because

Tokenization • The federal government suspended sales of the U. S. savings bonds because Congress has n’t lifted the ceiling on government debt. • The Treasury said the U. S. will default on Nov. 9 if Congress does n’t act by then. 6

Tokenization • Assets of the 400 taxable funds grew by $ 1. 5 billion

Tokenization • Assets of the 400 taxable funds grew by $ 1. 5 billion during the latest week. • Exports in October stood $ 5. 29 billion , a mere 0. 7 % increase from a year earlier , while imports increased sharply to $ 5. 39 billion , up 20 % from last year. • Do you notice any ambiguity in tokenization? 7

Tokenization • Assets of the 400 taxable funds grew by $ 1. 5 billion

Tokenization • Assets of the 400 taxable funds grew by $ 1. 5 billion during the latest week. • Exports in October stood $ 5. 29 billion , a mere 0. 7 % increase from a year earlier , while imports increased sharply to $ 5. 39 billion , up 20 % from last year. • Do you notice any ambiguity in tokenization ? 8

Exercise • How many sentences in the WSJ corpus of the Penn Treebank contain

Exercise • How many sentences in the WSJ corpus of the Penn Treebank contain “’re”? • How many sentences in the WSJ corpus of the Penn Treebank contain “’d”? 9

Big deal, you say • The problem is pushed to the forefront for languages

Big deal, you say • The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words �句�里有几个� ? Howmanywordsarethereinthissentence? 10

Big deal, you say • The problem is pushed to the forefront for languages

Big deal, you say • The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words zhe ju hua li you ji ge ci � 句 � 里 有 几 个 � ? this CL sentence inside have [how many] CL word ? How many words are there in this sentence ? 11

A much harder problem than it first appears… • Well, what if we just

A much harder problem than it first appears… • Well, what if we just create a list of words (a dictionary) and compare the sentence against this list? • 日文章�怎么� ? Dictionary entries: 日 “Sun”, 日文 “Japanese”, ,文章 “article”, ,章� “octopus”, � “fish” 怎么 “how” � “say” 12

A much harder problem than it first appears… • Well, what if we just

A much harder problem than it first appears… • Well, what if we just create a list of words (a dictionary) and compare the sentence against this list? • 日文 章� 怎么 � ? Japanese Octopus how say How do you say octopus in Japanese? • 日 文章 � 怎么 � ? Sun article fish how say ? ? ? 13

Computer problem vs human problem • Well that may be a problem for the

Computer problem vs human problem • Well that may be a problem for the computer because the computer is dumb… • Segmentation is difficult for humans as well w What is a word? w Different criteria do not coincide 14

What if we let native speakers follow their intuitions? • Inadequate level of inter-annotator

What if we let native speakers follow their intuitions? • Inadequate level of inter-annotator agreement w Sproat, 1996: 70% w Xue at al, 2005: 90% • Conclusion: need a linguistic definition of wordhood to develop segmentation standards 15

Packard’s (2000) notion of words • Orthographic word: Words are defined by delimiters in

Packard’s (2000) notion of words • Orthographic word: Words are defined by delimiters in written text. This appears to have no relevance in Chinese since there are no such written delimiters • Sociological word: Following (Chao, 1968, pp. 136138), these are ‘that type of unit, intermediate in size between a phoneme and a sentence, which the general, non-linguistic public is conscious of, talks about, has an every day term for, and is practically concerned with in various ways. ’ In English this is the lay notion of ‘word’, whereas in Chinese this is the character (字zi). 16

Packard’s notions of word • Lexical word: This corresponds to Di Sciullo and Williams’s

Packard’s notions of word • Lexical word: This corresponds to Di Sciullo and Williams’s (1987) listeme • Semantic word: Roughly speaking this corresponds to a “unitary concept”. • Phonological word: defined according to phonological criteria. Is it a domain that a phonological process applies? Is it s prosodic unit? 17

Packard’s notions of word • Morphological word: following Di Sciullo and Williams (1987), a

Packard’s notions of word • Morphological word: following Di Sciullo and Williams (1987), a morphological word is anything that is the output of a phonological rule • Syntactic word: These are all and only the constructions that occupy X 0 in the syntax. Well first you need to know what X 0 is. • Psycholinguistic word: this the “ ‘word’ level of linguistic analysis that is … salient and highly relevant to the operation of the language processor” 18

Wordhood tests • Phonological: w Bound morpheme: a bound morpheme forms a word with

Wordhood tests • Phonological: w Bound morpheme: a bound morpheme forms a word with its neighboring morpheme • Syntactic: w Insertion: if another morpheme can be inserted between X and Y, then it is unlikely a word. w XP-substitution: if a morpheme cannot be replaced with an XP of the same type, then it is likely to be a word 19

Wordhood tests • Semantic w If the meaning of X-Y is non-compositional, then it

Wordhood tests • Semantic w If the meaning of X-Y is non-compositional, then it is a word • Others w Productivity: if a rule that combines morpheme X and morpheme Y is not productive, then X-Y is likely to be a word w Frequency of co-ocurrence: if morphemes X and Y cooccur frequently then they form a word 20

Exercise • Given the wordhood criteria and wordhood tests we have discussed, how many

Exercise • Given the wordhood criteria and wordhood tests we have discussed, how many words are there in the “can’t” ? 21

Answer • • Orthographical word: 1 Sociological word: ? Lexical word: 2 Semantic word:

Answer • • Orthographical word: 1 Sociological word: ? Lexical word: 2 Semantic word: 2 Phonological word: 1 Morphological word: 2 Syntactic word: 2 Psycholinguistic word: ? 22

Chinese morphological types • • • Reduplication Affixation Compounding Proper names Abbreviations 23

Chinese morphological types • • • Reduplication Affixation Compounding Proper names Abbreviations 23

Verbal reduplication �� shuo-shuo speak-speak little” 看看 kan-kan look-look a look” 走走 zou-zou walk-walk

Verbal reduplication �� shuo-shuo speak-speak little” 看看 kan-kan look-look a look” 走走 zou-zou walk-walk a walk” 磨磨 mo-mo rub-rub little” ���� taolun-taolun discuss-discuss a little” �教�教 qingjiao-qingjiao ask-ask little” “speak a “take “rub a “discuss “ask a 24

Verbal reduplication �一� shuo-shuo a little” 看一看 kan-kan “take a look” 走一走 zou-zou “take

Verbal reduplication �一� shuo-shuo a little” 看一看 kan-kan “take a look” 走一走 zou-zou “take a walk” 磨一磨 mo-mo a little” *��一�� *�教一�教 speak one speak “speak look one look walk one walk rub one rub “rub taolun-yi-taolun discuss-one-discuss qingjiao-yi-qingjiao ask-one-ask 25

Adjectival reduplication 舒服 shufu 舒舒服服 shushu-fufu “comfortable” 舒服舒服 shufu-shu-fu “enjoy” 干� ganjing 干干�� gangan-jing

Adjectival reduplication 舒服 shufu 舒舒服服 shushu-fufu “comfortable” 舒服舒服 shufu-shu-fu “enjoy” 干� ganjing 干干�� gangan-jing “very clean” 干�干� ganjing-ganjing “clean up” 糊涂 hutu 糊糊涂涂 huhu-tutu “muddleheaded” (? ) 糊涂糊涂 hutu-hutu 快活 快快活活 kuai-huohuo “happy” 快活快活 kuaihuo-kuaihuo “make happy” 漂亮 漂漂亮亮 piao-liang“pretty” 26

Prefixation 老 laowang” 小 xiaowang” 第 di“first” 初 chuthird” 可 ke- 老王 小王 第一

Prefixation 老 laowang” 小 xiaowang” 第 di“first” 初 chuthird” 可 ke- 老王 小王 第一 初三 可� lao-wang “old xiao-wang “small di yi chu san ke-ai “the “cute” 27

Suffixation 学 -xue 心理学 xinli-xue “psychology” 家 -jia 心理学家 xinli-xue-jia “psychologist” 化 -hua �化

Suffixation 学 -xue 心理学 xinli-xue “psychology” 家 -jia 心理学家 xinli-xue-jia “psychologist” 化 -hua �化 lv-hua “greenize? ? ” 率 -lv �取率 luqu-lv “enrollment rate” 主� -zhuyi �克思主� makesi-zhuyi“marxism” 28

Compounding Location: 客� 沙� keting-shafa “living room sofa” 河 � hema “river horse (hippopotamus)”

Compounding Location: 客� 沙� keting-shafa “living room sofa” 河 � hema “river horse (hippopotamus)” 海 � haishi “sea lion (seal)” Used for: 指甲 油 zhijia you “nail polish” �� 球 pingpang qiu “ping-pang ball” 太阳眼� taiyang yanjing “sunglasses” Material: 大理石 地板 talishi diban “marble floor” �老 虎 zhilaohu “paper tiger” 29

Resultative verb compounding Result: 打破 dapo “break by hitting” 拉开 lakai “open by pulling”

Resultative verb compounding Result: 打破 dapo “break by hitting” 拉开 lakai “open by pulling” Achievement: 写清楚 xieqingchu “write clearly” �到 maidao “succeed in buying” Direction: 跳�去 tiaoguoqu “jump across” 走�来 zoujinlai “come walking in” 30

Subject-Verb compounds �疼 tou-teng (head hurt) “have a headache” 嘴硬 zui-ying (mouth hard) “stubborn”

Subject-Verb compounds �疼 tou-teng (head hurt) “have a headache” 嘴硬 zui-ying (mouth hard) “stubborn” 眼� yan-hong (eye red) “covet” 心酸 xin-suan (heart sour) “feel sad” 命苦 ming-ku (fate bitter) “unlucky” 31

Subject-Verb compounds 我 的 � 很 疼 I DE head very hurt “My head

Subject-Verb compounds 我 的 � 很 疼 I DE head very hurt “My head hurts badly. ” � 事 � 我 很 �疼 This matter make I very headache “This gave me a real headache”. 32

Verb-object compounds 出版 chu-ban (emit edition) “publish” 睡� shui-jiao (sleep) “sleep” �� bi-ye (finish

Verb-object compounds 出版 chu-ban (emit edition) “publish” 睡� shui-jiao (sleep) “sleep” �� bi-ye (finish study) “graduate” 开刀 kai-dao (operate knife) “operate” 开玩笑 kai-wanxiao (make joke) “make a joke” 照相 zhao-xiang (shine image) “take a picture” 33

Verb-object compounds � 开玩笑 ! Do not joke! 开 他 的 玩笑。 Make he

Verb-object compounds � 开玩笑 ! Do not joke! 开 他 的 玩笑。 Make he DE joke Make fun of him. 34

Let’s try one 她 很 担心 孩子 的 健康 成� Test type phonological syntactic

Let’s try one 她 很 担心 孩子 的 健康 成� Test type phonological syntactic semantic others Test Bound morpheme Test result Yes? Prediction One word Syllable count insertion yes no One word XP substitution Non-compositional productive frequency no yes N/A One word N/A 35

But … • 担心: 她 � 孩子 担 心 Test type Test result Prediction

But … • 担心: 她 � 孩子 担 心 Test type Test result Prediction Bound morphemes? no Two words Syllable count yes One word Insertion yes Two words XP substitution yes Two words Semantic Non-compositional? yes One word Others Productive? N/A Frequent co-occurrence? N/A phonological syntactic Test 36

Summary • Wordhood has to be decided in context • When wordhood tests lead

Summary • Wordhood has to be decided in context • When wordhood tests lead to conflict predictions, decisions will have to be made based on what the annotated corpus is for. 37

Discussion question • Based on word criteria we have discussed, is “make headway” one

Discussion question • Based on word criteria we have discussed, is “make headway” one word or two words? 38

POS-tagging: throwing words into different buckets… • Each category is a bucket • How

POS-tagging: throwing words into different buckets… • Each category is a bucket • How many buckets are there? w Noun w Verb w Adjective w Preposition w Adverb • Which bucket should“five”, “the”, “$”, should go? 39

Penn Treebank Tagsets (buckets) • • • CC - coordinating conjunction: and, but CD

Penn Treebank Tagsets (buckets) • • • CC - coordinating conjunction: and, but CD - cardinal number: one, two, three DT - determiner: a, the, this, that EX - existential there FW - foreign word IN - preposition or subordinate conjunction LS - list marker: firstly, secondly To - to UH - interjection, uh, oh 40

CC or DT • Neither/? ? he or/CC she likes skiing. • Neither/? ?

CC or DT • Neither/? ? he or/CC she likes skiing. • Neither/? ? men like skiing. • Either/? ? Jean or/CC Mary likes singing. • Either/? ? Girl likes singing. • Both/? ? Jack and/CC Tom hates singing. • Both/? ? men hates singing. 41

CC or DT • Neither/CC he or/CC she likes skiing. • Neither/DT men like

CC or DT • Neither/CC he or/CC she likes skiing. • Neither/DT men like skiing. • Either/CC Jean or/CC Mary likes singing. • Either/DT Girl likes singing. • Both/CC Jack and/CC Tom hates singing. • Both/DT men hates singing. 42

CD or NN • One/? ? of the best reasons • The only one/?

CD or NN • One/? ? of the best reasons • The only one/? ? Of its kind • The only ones/? ? of its kind 43

CD or NN • One/CD of the best reasons • The only one/NN Of

CD or NN • One/CD of the best reasons • The only one/NN Of its kind • The only ones/NN of its kind 44

EX or RB • • There/? ? was a party in progress. There/? ?

EX or RB • • There/? ? was a party in progress. There/? ? ensued a melee. There/? ? , a party was in progress. There/? ? , ensued a melee. 45

EX or RB • • There/EX was a party in progress. There/EX ensued a

EX or RB • • There/EX was a party in progress. There/EX ensued a melee. There/RB , a party was in progress. There/RB , ensued a melee. 46

The role of context in POS tagging • Can we take a list of

The role of context in POS tagging • Can we take a list of all the words in a language, and decide which bucket each word should go, without looking at the context in which the word occurs? • Water, can, drops 47

Categorizing context • Morphological • Syntactic • Semantic 48

Categorizing context • Morphological • Syntactic • Semantic 48

Morphological context • Inflectional morphology Verb: destroy, destroying, destroyed Noun: destruction, destructions He watered

Morphological context • Inflectional morphology Verb: destroy, destroying, destroyed Noun: destruction, destructions He watered the plant. • Derivational morphology Noun: destruction 49

Syntactic context • Verb: The bomb destroyed the building. He decided to water the

Syntactic context • Verb: The bomb destroyed the building. He decided to water the plant. • Noun: The destruction of building 50

Semantic context • Verb: action, activity • Noun: state, object, etc. 51

Semantic context • Verb: action, activity • Noun: state, object, etc. 51

What do we have in Chinese? • Morphological clues: not as much • Syntactic

What do we have in Chinese? • Morphological clues: not as much • Syntactic clues: not as rich, but exist • Semantic clues: About the same 52

Syntactic clues • Impoverished, but exist: � 座 大楼 的 倒塌 this CL building

Syntactic clues • Impoverished, but exist: � 座 大楼 的 倒塌 this CL building DE collapse “the collapse of this building” � 座 大楼 看起来要 倒塌 This CL building seem will collapse “It looks like this building will collapse. ” 53

Semantic clues • Same as English: w Noun: state, object, etc. w Verb: action,

Semantic clues • Same as English: w Noun: state, object, etc. w Verb: action, activity, etc. 54

When syntactic and semantic clues are in conflict � 座 大楼 的 倒塌 this

When syntactic and semantic clues are in conflict � 座 大楼 的 倒塌 this CL building DE collapse “the collapse of this building” Option 1: 倒塌 is a verb regardless of its context Option 2: 倒塌 can be a noun or a verb depending on its context The Chinese Treebank decision: option 2 w POS tags based on syntactic clues encode not only its own lexical properties, but also information provided by its context w “context-free” POS tags are no better than a dictionary 55

Online references • Chinese Treebank: www. cis. upenn. edu/~chinese • Sproat, Richard. 2002. Coling

Online references • Chinese Treebank: www. cis. upenn. edu/~chinese • Sproat, Richard. 2002. Coling tutorial: www. linguistics. uiuc. edu/rws • Penn Treebank: www. cis. upenn. edu/~treebank/home. html 56