Verb Valency Frame Extraction Using Morphological and Syntactic
Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian Krešimir Šojat, Željko Agić, Marko Tadić Department of Linguistics, Department of Information Sciences Faculty of Humanities and Social Sceinces, University of Zagreb {ksojat, zagic, marko. tadic}@ffzg. hr FASSBL 7 Conference Dubrovnik, Croatia 2010 -10 -05
Overview § What? § extraction and semi-automatic construction of verb valency frames § How? § rule-based extraction procedure run on the Croatian dependency treebank § manual assignment of tectogrammatical functors § inference of rules for assigning functors to unseen text § Why? § creation of treebank-based verb valency lexicon § enhancement and enrichment of existing resources
Valency frames § valency frame extraction means to detect all possible environments of particular verb as found in the treebank § such an approach aims at fast construction of valency frames § extraction is automatic, no elements of frames added manually by human annotators § such automatically acquired verb valency lexicon can serve as a basis for further enrichment and enhancement of manually constructed resources, either existing or constructed from scratch
The treebank § Croatian Dependency Treebank (HOBS) § follows the guidelines of the Prague DT § taken from the Croatia Weekly 100 kw sub-corpus of the Croatian National Corpus (HNK) § XCES-encoded up to the word level § sentence-delimited, tokenized, manually lemmatized and MSD-tagged § serves as the morphological layer of the treebank § § annotated on the syntactic layer approximately 2. 700 sentences, 67. 000 tokens manually assigned syntactic functions ca 1. 300 sentences double-checked and used in this experiment
The treebank HR EN Unija je već dogovorila neke mjere kako bi pomogla Hrvatskoj. The Union has already arranged some measures in order to help Croatia.
Extraction algorithm § the algorithm aims at extraction of verb valency frame instances § for each verb in the treebank sample, it descends § one level down the dependency tree to retrieve subjects (Sb), objects (Obj), adverbs (Adv) and nominal predicates (Pnom) § Two levels down to retrieve tokens from the previous step introduced by subordinate conjunctions (Aux. C) or prepositions (Aux. P)
Extraction algorithm § algorithm illustration dogovorila (dogovoriti Pred) [Unija Ncfsn Sb] [mjere Ncfpa Obj] [već Rt Adv] [kako Css Aux. C]
Extraction algorithm § the first version retrieved predicates only and was expanded to retrieve all the verbs from the treebank sample § algorithm adapted to retrieve any verbs found in the dependency structure, regardless of their respective analytical functions and position within the dependency trees § the adaptation itself is implemented in order to raise the recall of the algorithm, while still maintaining its precision by not changing the simple set of descending rules § i. e. to retrieve as much verbs as possible given the limited size of the treebank sample used in the experiment CCCCyyyy Location yyyy-mm-dd
Extraction algorithm § the verb “imati” (Vmn) is annotated as object (Obj)
Extraction algorithm § Thus, from each sentence the number of extracted frames correspondes to the number of verbs: § one frame for the main clause that captures the whole syntactic structure of the sentence § frames extracted from dependent clauses naglasio (naglasiti Vmps-sma Pred) [Mikuška Np-sn Sb] [kako->imati Css Aux. C->Obj] imati (imati Vmn Obj) [stanovništvo Ncnsn Sb] [korist Ncfsa Obj] [od->projekta Spsg->Ncmsg Aux. P->Adv] [kroz->ekoturizam Spsa->Ncmsa Aux. P->Adv]
Functor assignment § In order to annotate verbal frames we used a set of 5 argument functors and functors for 32 free modification functors: § Argument functors: ACT, PAT, ADDR, ORIG, EFF § Temporal functors: TWHEN, TFHL, TFRWH, THL, THO, TOWH, TPAR, TSIN, TTILL § Locative and directional functors: DIR 1, DIR 2, DIR 3, LOC § Functors for causal relations: AIM, CAUS, CNCS, COND, INTT § Functors for expressing manner: ACMP, CPR, CRIT, DIFF, EXT, MANN, MEANS, REG, RESL, RESTR § Functors for specific modifications: BEN, CONTRD, HER, SUBS § 936 frame instances were manually annotated for 424 different verbs
Results § valency frame frequency across verb lemmas Verb biti imati reći dobiti raditi kazati postati vidjeti dati Frequency 188 23 15 12 10 9 8 8 8 7 raditi (en. to work, to do) Valency frame Frequency ACT PAT 2 ACT CRIT LOC THL 1 ACT MANN TWHEN 1 ACT MEANS TWHEN 1 ACT PAT TSIN 1 dati (en. to give) Valency frame Frequency ACT ADDR PAT 4 ACT ADDDR PAT 1 ACT ADDR AIM PAT 1 ACT PAT 1
Results § frequency of verb valency frames, i. e. n-tuples of tectogrammatical functors Frame ACT PAT* ACT PAT TWHEN ACT MANN PAT ACT ADDR PAT ACT LOC PAT MANN PAT ACT CAUS PAT ACT MANN LOC PAT ADDR PAT Count 250 157 30 23 20 20 20 17 16 13 12 11 Percent 26. 71 16. 77 3. 21 2. 46 2. 14 1. 82 1. 71 1. 39 1. 28 1. 18 Other 347 37. 07
Results § frames annotated with MSD, analytical functions and tectogrammatical functors CCCCyyyy Location yyyy-mm-dd djelovati (djeluje Pred) [ neozbiljno Neozbiljno Rnp Adv MANN ] [ odustajanje Ncnsn Sb ACT ] osloboditi (oslobodili Pred) [ ACT ] [ nikada Nikada Rt Adv THL ] [ zloduha Ncmsg Obj PAT ] postati (postali Pred) [studiji Ncmpn Sb ACT] [fakultet Ncmsn Obj PAT] zaustaviti (zaustavio Atr) [ ACT ] [ oni ih Pp 3 -pa--y-n-- Obj PAT ] [ dolina u->dolini Spsl->Ncfsl Aux. P->Adv LOC ]
Results § Distribution of (MSD, analytical function) pairs across tectogrammatical functors ACT (Actor) PAT (Patient) LOC (Locative) A-fun MSD % Sb Ncmsn 14. 91 Obj Ncfsa 11. 25 Sb Np-sn 13. 50 Obj Ncmsa 9. 18 Sb Ncfsn 12. 87 Pnom Ncmsn 5. 69 Sb Ncmpn 9. 89 Obj Ncmpa 4. 53 Sb Npfsn 5. 65 Obj Vmn* 4. 40 Sb Pi-mpn--n-a -- 4. 71 Obj Ncnsa 3. 75 A-fun (Aux. P) Adv (Aux. P) Adv MSD % (Spsl) Ncfsl 21. 88 (Spsl) Ncmsl 16. 41 (Spsl) Npmsl 10. 16 (Spsl) Ncnsl 8. 59 (Spsl) Npfsl 8. 59 (Spsl) Ncmpl 5. 47 Sb Ncfpn 3. 30 for defining Obj Ncfpa 3. 49 assignment (Spsl) Ncfpl 3. 91 § serves as basis functor rules Sb Ncnsn 2. 98 Pnom Ncfsn 2. 72 Rl 3. 13 from MSD and analytical function Pi-msn--n-a (Aux. C) (Css) Sb Sb -Pi-fsn--n-a-- 2. 51 1. 88 Obj Vmip 3 s Obj Ncmsn 2. 07 Adv 1. 81 (Aux. P) Adv Css 1. 56 (Spsg)Ncmsg 1. 56
Conclusions § in this experiment we have designed and implemented one possible approach: § to semi-automatic extraction of a valency frame lexicon for Croatian verbs § to the refinement of existing lexicons by using the Croatian Dependency Treebank as an underlying resource § we have automatically extracted 2930 verb valency frame instances and annotated 936 frames: § the distribution of valency frames for each of the encountered verbs § the distribution of analytical functions and morphosyntactic tags for each of the tectogrammatical functors
Future work § the first result enables the enrichment of existing valency lexicons, such as CROVALLEX § the second result enables the implementation of a rule-based system for automatic assignment of tectogrammatical functors to morphosyntactically tagged and dependency-parsed unseen text § this procedure of automatic detection of valency frames will be used also in several other projects dealing with factored SMT (e. g. ACCURAT) § regarding dependency parsing of Croatian by using the Croatian Dependency Treebank, we shall undergo various research directions in order to increase overall parsing accuracy
Thank you for your attention. www. accurat-project. eu The research within the project ACCURAT leading to these results has received funding from the European Union Seventh Framework Programme (FP 7/20072013), grant agreement no 248347.
- Slides: 18