Merkingarbrunnur fyrir slenska mltkni Matthew Whelpton og Anna
Merkingarbrunnur fyrir íslenska máltækni Matthew Whelpton og Anna Björk Nikulásdóttir Hugvísindaþing 2009 1
Semantics in IT and NLP n Semantic information is an essential resource for Natural Language Processing applications, especially for: ¡ ¡ information extraction question answering machine translation integrated spoken language systems Hugvísindaþing 2009 2
First steps for Icelandic n Three examples of semantic resources for English ¡ ¡ ¡ n Word. Net Frame. Net Prop. Bank Overview of the first project aiming to develop a comparable resource for Icelandic ¡ ¡ Word. Net-like semantic network RANNÍS Hugvísindaþing 2009 3
Word. Net n Word. Net is ¡ ¡ ¡ an electronic database of (English) words and the semantic relations between them Hugvísindaþing 2009 4
Hyponymy n “fox” is_a_kind_of “animal” ¡ ¡ ¡ every fox is an animal NOT every animal is a fox “fox” is a hyponym of “animal” n n ¡ hypo ‘under’ nym ‘name’ “animal” is a hypernym of “fox” n n hyper ‘over’ nym ‘name’ Hugvísindaþing 2009 5
Two simple uses n Helps with semantically sensitive searches ¡ ¡ n find animals that eggs Word. Net tells us that “fox” as an “animal” meets the first criterion of this search Helps with reference tracking ¡ The farmer saw a fox. The animal was hurt. n => the fox was hurt Hugvísindaþing 2009 6
Synonymy n At the heart of Word. Net is the notion of a synset ¡ ¡ a synset = a set of synonyms synonym n n ¡ having the same meaning syn ‘same’; nym ‘name’ in Word. Net words count as synonyms if they can be substituted for each other in some contexts Hugvísindaþing 2009 7
Synsets n Word. Net lists “animal” as synonymous with “creature” and “beast”. ¡ The farmer saw a fox. The animal/creature/beast was hurt. n n n => the fox was hurt Word. Net lists “farmer” as a hyponym of “person” not “animal”. “person” and “animal” are hyponyms of “being” Hugvísindaþing 2009 8
Fragment of a Word. Net being person {animal; creature; beast} farmer fox Hugvísindaþing 2009 9
Frame. Net n n Word. Net focuses on individual words and semantic relations between them. Frame. Net focuses on generalised situations (frames) and the entities and activities which characterise those situations (frame elements) Hugvísindaþing 2009 10
Apply_heat Frame n Frame ¡ n Frame Elements ¡ ¡ ¡ n Apply_heat Cook Food Heating_Instrument Example ¡ [Cook Matilde] [Apply_heat fried] [Food the catfish] [Heating_instrument in a heavy iron skillet]. Hugvísindaþing 2009 11
Form and Meaning n Word. Net ¡ ¡ n Frame. Net ¡ ¡ n Individual words Semantic relations between them General situations (word knowledge) How words characterise those situations Prop. Bank ¡ Relation between n n ¡ Syntactic Structure Predicate Argument Structure U. Penn. Tree. Bank + predicate argument information Hugvísindaþing 2009 12
Syntactic and Argument Structure n What do you like? Treebank annotation: Propbank annotation: (SBARQ (WHNP-1 what) (SQ do (NP-SBJ you) (VP like (NP *T*-1)))) Rel: like Arg 0: you Arg 1: [*T*] -> What Hugvísindaþing 2009 13
The Next Step (RANNÍS) n All of the resources mentioned were manually created ¡ ¡ n n extremely labour-intensive unrealistic for Icelandic Search for automated methods Anna Nikulásdóttir has already begun such work on the creation of a Word. Net-like semantic database for Icelandic and this will be the focus of the new RANNÍS project. Hugvísindaþing 2009 14
Greining merkingarvensla úr ÍO n n Níu mismunandi merkingarvensl greind sjálfvirkt úr Íslenskri orðabók (ÍO) Notast við orðflokka- og blönduð orða- og orðflokkamynstur: l_n yfirheiti(fletta, n) eiginleiki(fletta, l) það að_s tengt. So(fletta, s) Hugvísindaþing 2009 15
Greining merkingarvensla úr ÍO fuglsblundur örstuttur (lo. ) svefn (no. ) yfirheiti(fuglsblundur, svefn) eiginleiki(fuglsblundur, örstuttur) mæling það að mæla (so. ) tengt. So(mæling, mæla) Hugvísindaþing 2009 16
Greining merkingarvensla úr ÍO Hugvísindaþing 2009 17
Greining merkingarvensla úr ÍO n Niðurstöður: ¡ ¡ 77. 348 skýringar eða 96, 45% allra nafnorðaskýringa í ÍO greindar Prófunarsett: 94, 77% skýringa án rangrar greiningar Hugvísindaþing 2009 18
Merkingarnám úr textum n n n Merkingarnám úr textum þarfnast gífurlegs magns markaðra texta Mörkuð íslensk málheild mun hér koma að miklum notum Stefnt er að því að finna mynstur í íslensku sambærileg svokölluðum Hearst-mynstrum í ensku Hugvísindaþing 2009 19
Hearst-mynstur n NP 0 such as NP 1, NP 2, . . . , NPn-1 (and|or) NPn. . . red algae, such as Gelidium, . . . n such NP 0 as NP 1, NP 2, . . . , NPn-1 (and|or) NPn. . . works by such authors as Herrick, Goldsmith, and Shakespeare n NP 1, NP 2, . . . , NPn (and|or) other NP 0 Bruises, wounds, broken bones or other injuries. . . temples, treasuries, and other important civic buildings Hugvísindaþing 2009 20
Möguleg mynstur í íslensku? n NP 0, utan NP 1, NP 2, . . . , NPn-1 og NPn. . . heimilisstörf, utan eldamennsku. . . n NP 1, NP 2, . . . , NPn eða (aðra|annan|annað|önnur) NP 0. . . Diesel-gallabuxur eða aðra merkjavöru n NP 0 eins og NP 1, NP 2, . . . , NPn-1 (og|eða) NPn. . . grunngildum eins og góðum samskiptum. . . Hugvísindaþing 2009 21
Merkingarbrunnur? n n n Stefnt er að því að tengja niðurstöður innbyrðis Þróað verður tól með grafísku viðmóti til þess að leiðrétta og bæta við niðurstöður sjálfvirku greiningarinnar Möguleikar á tengingu við Word. Net verða kannaðir Hugvísindaþing 2009 22
- Slides: 22