Linguistic annotation 2142006 Nianwen Xue Outline Tokenization segmentation
- Slides: 56
Linguistic annotation 2/14/2006 Nianwen Xue
Outline • Tokenization / segmentation, POS tagging • Treebanking w Constituent structure and structural ambiguity w Basic grammatical relations and how argument structure is instantiated • Propbanking/nombanking w Cross-linguistic syntactic alternations, verb senses and argument structure • Others: named entity, coreference, discourse connectives 2
Tokenization • English w In the new position he will oversee Mazda ’s U. S. sales , services , parts and marketing operations. w We did n’t have much of a choice. w U. S. trade officials said the Philippines and Thailand would be the main beneficiaries of the president ‘s action. w Anything ‘s possible -- how about the new Guinea Fund ? 3
Tokenization • English w In the new position he will oversee Mazda ’s U. S. sales , services , parts and marketing operations. w We did n’t have much of a choice. w U. S. trade officials said the Philippines and Thailand would be the main beneficiaries of the president ‘s action. w Anything ‘s possible -- how about the new Guinea Fund ? 4
Tokenization • The federal government suspended sales of the U. S. savings bonds because Congress has n’t lifted the ceiling on government debt. • The Treasury said the U. S. will default on Nov. 9 if Congress does n’t act by then. 5
Tokenization • The federal government suspended sales of the U. S. savings bonds because Congress has n’t lifted the ceiling on government debt. • The Treasury said the U. S. will default on Nov. 9 if Congress does n’t act by then. 6
Tokenization • Assets of the 400 taxable funds grew by $ 1. 5 billion during the latest week. • Exports in October stood $ 5. 29 billion , a mere 0. 7 % increase from a year earlier , while imports increased sharply to $ 5. 39 billion , up 20 % from last year. • Do you notice any ambiguity in tokenization? 7
Tokenization • Assets of the 400 taxable funds grew by $ 1. 5 billion during the latest week. • Exports in October stood $ 5. 29 billion , a mere 0. 7 % increase from a year earlier , while imports increased sharply to $ 5. 39 billion , up 20 % from last year. • Do you notice any ambiguity in tokenization ? 8
Exercise • How many sentences in the WSJ corpus of the Penn Treebank contain “’re”? • How many sentences in the WSJ corpus of the Penn Treebank contain “’d”? 9
Big deal, you say • The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words �句�里有几个� ? Howmanywordsarethereinthissentence? 10
Big deal, you say • The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words zhe ju hua li you ji ge ci � 句 � 里 有 几 个 � ? this CL sentence inside have [how many] CL word ? How many words are there in this sentence ? 11
A much harder problem than it first appears… • Well, what if we just create a list of words (a dictionary) and compare the sentence against this list? • 日文章�怎么� ? Dictionary entries: 日 “Sun”, 日文 “Japanese”, ,文章 “article”, ,章� “octopus”, � “fish” 怎么 “how” � “say” 12
A much harder problem than it first appears… • Well, what if we just create a list of words (a dictionary) and compare the sentence against this list? • 日文 章� 怎么 � ? Japanese Octopus how say How do you say octopus in Japanese? • 日 文章 � 怎么 � ? Sun article fish how say ? ? ? 13
Computer problem vs human problem • Well that may be a problem for the computer because the computer is dumb… • Segmentation is difficult for humans as well w What is a word? w Different criteria do not coincide 14
What if we let native speakers follow their intuitions? • Inadequate level of inter-annotator agreement w Sproat, 1996: 70% w Xue at al, 2005: 90% • Conclusion: need a linguistic definition of wordhood to develop segmentation standards 15
Packard’s (2000) notion of words • Orthographic word: Words are defined by delimiters in written text. This appears to have no relevance in Chinese since there are no such written delimiters • Sociological word: Following (Chao, 1968, pp. 136138), these are ‘that type of unit, intermediate in size between a phoneme and a sentence, which the general, non-linguistic public is conscious of, talks about, has an every day term for, and is practically concerned with in various ways. ’ In English this is the lay notion of ‘word’, whereas in Chinese this is the character (字zi). 16
Packard’s notions of word • Lexical word: This corresponds to Di Sciullo and Williams’s (1987) listeme • Semantic word: Roughly speaking this corresponds to a “unitary concept”. • Phonological word: defined according to phonological criteria. Is it a domain that a phonological process applies? Is it s prosodic unit? 17
Packard’s notions of word • Morphological word: following Di Sciullo and Williams (1987), a morphological word is anything that is the output of a phonological rule • Syntactic word: These are all and only the constructions that occupy X 0 in the syntax. Well first you need to know what X 0 is. • Psycholinguistic word: this the “ ‘word’ level of linguistic analysis that is … salient and highly relevant to the operation of the language processor” 18
Wordhood tests • Phonological: w Bound morpheme: a bound morpheme forms a word with its neighboring morpheme • Syntactic: w Insertion: if another morpheme can be inserted between X and Y, then it is unlikely a word. w XP-substitution: if a morpheme cannot be replaced with an XP of the same type, then it is likely to be a word 19
Wordhood tests • Semantic w If the meaning of X-Y is non-compositional, then it is a word • Others w Productivity: if a rule that combines morpheme X and morpheme Y is not productive, then X-Y is likely to be a word w Frequency of co-ocurrence: if morphemes X and Y cooccur frequently then they form a word 20
Exercise • Given the wordhood criteria and wordhood tests we have discussed, how many words are there in the “can’t” ? 21
Answer • • Orthographical word: 1 Sociological word: ? Lexical word: 2 Semantic word: 2 Phonological word: 1 Morphological word: 2 Syntactic word: 2 Psycholinguistic word: ? 22
Chinese morphological types • • • Reduplication Affixation Compounding Proper names Abbreviations 23
Verbal reduplication �� shuo-shuo speak-speak little” 看看 kan-kan look-look a look” 走走 zou-zou walk-walk a walk” 磨磨 mo-mo rub-rub little” ���� taolun-taolun discuss-discuss a little” �教�教 qingjiao-qingjiao ask-ask little” “speak a “take “rub a “discuss “ask a 24
Verbal reduplication �一� shuo-shuo a little” 看一看 kan-kan “take a look” 走一走 zou-zou “take a walk” 磨一磨 mo-mo a little” *��一�� *�教一�教 speak one speak “speak look one look walk one walk rub one rub “rub taolun-yi-taolun discuss-one-discuss qingjiao-yi-qingjiao ask-one-ask 25
Adjectival reduplication 舒服 shufu 舒舒服服 shushu-fufu “comfortable” 舒服舒服 shufu-shu-fu “enjoy” 干� ganjing 干干�� gangan-jing “very clean” 干�干� ganjing-ganjing “clean up” 糊涂 hutu 糊糊涂涂 huhu-tutu “muddleheaded” (? ) 糊涂糊涂 hutu-hutu 快活 快快活活 kuai-huohuo “happy” 快活快活 kuaihuo-kuaihuo “make happy” 漂亮 漂漂亮亮 piao-liang“pretty” 26
Prefixation 老 laowang” 小 xiaowang” 第 di“first” 初 chuthird” 可 ke- 老王 小王 第一 初三 可� lao-wang “old xiao-wang “small di yi chu san ke-ai “the “cute” 27
Suffixation 学 -xue 心理学 xinli-xue “psychology” 家 -jia 心理学家 xinli-xue-jia “psychologist” 化 -hua �化 lv-hua “greenize? ? ” 率 -lv �取率 luqu-lv “enrollment rate” 主� -zhuyi �克思主� makesi-zhuyi“marxism” 28
Compounding Location: 客� 沙� keting-shafa “living room sofa” 河 � hema “river horse (hippopotamus)” 海 � haishi “sea lion (seal)” Used for: 指甲 油 zhijia you “nail polish” �� 球 pingpang qiu “ping-pang ball” 太阳眼� taiyang yanjing “sunglasses” Material: 大理石 地板 talishi diban “marble floor” �老 虎 zhilaohu “paper tiger” 29
Resultative verb compounding Result: 打破 dapo “break by hitting” 拉开 lakai “open by pulling” Achievement: 写清楚 xieqingchu “write clearly” �到 maidao “succeed in buying” Direction: 跳�去 tiaoguoqu “jump across” 走�来 zoujinlai “come walking in” 30
Subject-Verb compounds �疼 tou-teng (head hurt) “have a headache” 嘴硬 zui-ying (mouth hard) “stubborn” 眼� yan-hong (eye red) “covet” 心酸 xin-suan (heart sour) “feel sad” 命苦 ming-ku (fate bitter) “unlucky” 31
Subject-Verb compounds 我 的 � 很 疼 I DE head very hurt “My head hurts badly. ” � 事 � 我 很 �疼 This matter make I very headache “This gave me a real headache”. 32
Verb-object compounds 出版 chu-ban (emit edition) “publish” 睡� shui-jiao (sleep) “sleep” �� bi-ye (finish study) “graduate” 开刀 kai-dao (operate knife) “operate” 开玩笑 kai-wanxiao (make joke) “make a joke” 照相 zhao-xiang (shine image) “take a picture” 33
Verb-object compounds � 开玩笑 ! Do not joke! 开 他 的 玩笑。 Make he DE joke Make fun of him. 34
Let’s try one 她 很 担心 孩子 的 健康 成� Test type phonological syntactic semantic others Test Bound morpheme Test result Yes? Prediction One word Syllable count insertion yes no One word XP substitution Non-compositional productive frequency no yes N/A One word N/A 35
But … • 担心: 她 � 孩子 担 心 Test type Test result Prediction Bound morphemes? no Two words Syllable count yes One word Insertion yes Two words XP substitution yes Two words Semantic Non-compositional? yes One word Others Productive? N/A Frequent co-occurrence? N/A phonological syntactic Test 36
Summary • Wordhood has to be decided in context • When wordhood tests lead to conflict predictions, decisions will have to be made based on what the annotated corpus is for. 37
Discussion question • Based on word criteria we have discussed, is “make headway” one word or two words? 38
POS-tagging: throwing words into different buckets… • Each category is a bucket • How many buckets are there? w Noun w Verb w Adjective w Preposition w Adverb • Which bucket should“five”, “the”, “$”, should go? 39
Penn Treebank Tagsets (buckets) • • • CC - coordinating conjunction: and, but CD - cardinal number: one, two, three DT - determiner: a, the, this, that EX - existential there FW - foreign word IN - preposition or subordinate conjunction LS - list marker: firstly, secondly To - to UH - interjection, uh, oh 40
CC or DT • Neither/? ? he or/CC she likes skiing. • Neither/? ? men like skiing. • Either/? ? Jean or/CC Mary likes singing. • Either/? ? Girl likes singing. • Both/? ? Jack and/CC Tom hates singing. • Both/? ? men hates singing. 41
CC or DT • Neither/CC he or/CC she likes skiing. • Neither/DT men like skiing. • Either/CC Jean or/CC Mary likes singing. • Either/DT Girl likes singing. • Both/CC Jack and/CC Tom hates singing. • Both/DT men hates singing. 42
CD or NN • One/? ? of the best reasons • The only one/? ? Of its kind • The only ones/? ? of its kind 43
CD or NN • One/CD of the best reasons • The only one/NN Of its kind • The only ones/NN of its kind 44
EX or RB • • There/? ? was a party in progress. There/? ? ensued a melee. There/? ? , a party was in progress. There/? ? , ensued a melee. 45
EX or RB • • There/EX was a party in progress. There/EX ensued a melee. There/RB , a party was in progress. There/RB , ensued a melee. 46
The role of context in POS tagging • Can we take a list of all the words in a language, and decide which bucket each word should go, without looking at the context in which the word occurs? • Water, can, drops 47
Categorizing context • Morphological • Syntactic • Semantic 48
Morphological context • Inflectional morphology Verb: destroy, destroying, destroyed Noun: destruction, destructions He watered the plant. • Derivational morphology Noun: destruction 49
Syntactic context • Verb: The bomb destroyed the building. He decided to water the plant. • Noun: The destruction of building 50
Semantic context • Verb: action, activity • Noun: state, object, etc. 51
What do we have in Chinese? • Morphological clues: not as much • Syntactic clues: not as rich, but exist • Semantic clues: About the same 52
Syntactic clues • Impoverished, but exist: � 座 大楼 的 倒塌 this CL building DE collapse “the collapse of this building” � 座 大楼 看起来要 倒塌 This CL building seem will collapse “It looks like this building will collapse. ” 53
Semantic clues • Same as English: w Noun: state, object, etc. w Verb: action, activity, etc. 54
When syntactic and semantic clues are in conflict � 座 大楼 的 倒塌 this CL building DE collapse “the collapse of this building” Option 1: 倒塌 is a verb regardless of its context Option 2: 倒塌 can be a noun or a verb depending on its context The Chinese Treebank decision: option 2 w POS tags based on syntactic clues encode not only its own lexical properties, but also information provided by its context w “context-free” POS tags are no better than a dictionary 55
Online references • Chinese Treebank: www. cis. upenn. edu/~chinese • Sproat, Richard. 2002. Coling tutorial: www. linguistics. uiuc. edu/rws • Penn Treebank: www. cis. upenn. edu/~treebank/home. html 56
- What is tokenization
- Tokenization
- Tokenization in information retrieval
- Xue le moexuele
- Catalysis lecture notes
- Moe xue le
- Harvard catalyst biostatistics
- Polytomous
- Yao tong xue
- Chenghai xue
- Jilong xue
- Sandwich quotes examples
- Living space by imtiaz dharker analysis
- Types of holes
- Macbeth o valiant cousin worthy gentleman
- Catch annotations
- General zaroff first comes to the island after
- Ovid metamorphoses icarus
- David gene functional classification tool
- American law reports annotation
- Theme of acquainted with the night
- David go annotation
- Basking shark poem analysis
- Close reading symbols
- What is text annotation
- Maplex label engine
- Dulce et decorum est annotation
- Acquainted with the night annotation
- O captain my captain metaphors
- Maxims of annotation in corpus linguistics
- Annotation slows down the reader to deepen understanding
- De novo genome assembly ppt
- A valediction forbidding mourning critical analysis
- Tomato genome annotation
- The basking shark poem
- Morphological annotation
- Annotating a poem examples
- The two terms of comparison in the first two quatrains are
- Gene annotation
- Annotation guide
- Jumanji
- Critical annotation
- Close reading the art and craft of analysis
- Annotation slows down the reader to deepen understanding
- Amazon data annotation
- What is anotation
- Maker annotation tutorial
- Nbis
- Benjamin banneker letter to thomas jefferson annotation
- Eclipse annotation processor
- Gcse photography annotation
- Amazon data annotation
- Yesterday by patricia pogson analysis
- Bacteriophage annotation
- Annotation examples
- Ocr gcse art assessment objectives
- Annotation toolkit