FF FER Comparative Analysis of Automatic Term and

  • Slides: 15
Download presentation
FF & FER Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana

FF & FER Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

Overview I. FF & FER Introduction – II. Reasons for extraction Research – –

Overview I. FF & FER Introduction – II. Reasons for extraction Research – – Resources & tools Extracted lists III. Evaluation – Precision, recall, F-measure IV. Conclusion INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

I. Introduction FF & FER • Monolingual and multilingual resources – Helpful – Integrated

I. Introduction FF & FER • Monolingual and multilingual resources – Helpful – Integrated – Require human intervention • EU pre-accession activities – Speed up + consistency • Used in further research and practice INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

FF & FER • List: – Terms (Member State, European Union) – Collocations (adopt

FF & FER • List: – Terms (Member State, European Union) – Collocations (adopt a/the resolution, decided as follows) – Multi-word units (depend on, well-being) • Term extraction process: – Term extraction (term acquisition)- identification – Term recognition - verification INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

II. Research FF & FER • Resources – 10 documents – legislation, Cro-Eng •

II. Research FF & FER • Resources – 10 documents – legislation, Cro-Eng • Tools – Terme. X tool (FER) – list A – SDL Multi Term Extract + Noo. J (FF) – list B • Reference list – Evaluation – reference list INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

Reference list FF & FER • 470 terms and collocations • Exclude unigrams •

Reference list FF & FER • 470 terms and collocations • Exclude unigrams • Balance between lexical coverage, adequacy, practicality – terms (NPs: 346/470) – collocations (VPs) INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

Reference list FF & FER • Contains: – Terms (acquiring company, applicant country) –

Reference list FF & FER • Contains: – Terms (acquiring company, applicant country) – Collocations (adopt a/the resolution, decided as follows, entry into force, having regard to) – Names and abbreviations (Economic and Monetary Union EMU, European Union EU) – Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures). INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

List B FF & FER • Language-independent statistically-based SDL Multi Term Extract tool –

List B FF & FER • Language-independent statistically-based SDL Multi Term Extract tool – Frequency treshold set to 4 – Filtered by the list of stop-words -> 369 cand. • Language dependant Noo. J tool – 36 local grammars -> 512 cand. INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

List A FF & FER • Terme. X – Lexical association measures (AMs) –

List A FF & FER • Terme. X – Lexical association measures (AMs) – 14 AMs (PMI, Dice, Chi-square, …) – Lemmatization – POS filtering – Frequency treshold set to ? INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

List A FF & FER • Extracted terms ranked by AM value – 1816

List A FF & FER • Extracted terms ranked by AM value – 1816 candidates • AMs used: – 2 -grams – PMI – 3 -grams, 4 -grams – heuristic extensions • Noun phrases only INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

Results FF & FER • Evaluation – F 1 -measure (precision, recall) – True

Results FF & FER • Evaluation – F 1 -measure (precision, recall) – True positives calculated by taking into account inflection (suffix stripping) List A List B No. of terms 1816 508 Valid terms 202 234 Precision (%) 11. 56 47. 37 Recall (%) 42. 98 49. 79 F 1 (%) 18. 22 48. 55 INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

Results FF & FER • List A unsatisfactory – Low recall – Verb phrases,

Results FF & FER • List A unsatisfactory – Low recall – Verb phrases, terms consisting of more than 4 words – Low precision – ranked list, can be improved with cut-off (true positives are better ranked) • List B modest – can be improved with lemmatization, definition of upper/lower cases, more detailed local grammar INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

Conclusion FF & FER • Comparison of two hybrid approaches to term extraction •

Conclusion FF & FER • Comparison of two hybrid approaches to term extraction • Human created lists differ from extracted lists – human knowledge, experience and intuition • Space for improvement – automatic extraction combined human intervention INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

FF & FER Thank you! INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7

FF & FER Thank you! INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

FF & FER INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009

FF & FER INFuture 2009: Digital Resources and Knowledge Sharing, 4 -7 November 2009