Collocations and Terminology Vasileios Hatzivassiloglou University of Texas

  • Slides: 22
Download presentation
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas

Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas

Collocations • Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993 • Recurrent combinations

Collocations • Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993 • Recurrent combinations of words that cooccur more often than chance, often with non-compositional meaning • Technical and non-technical

Examples of collocations • • The Dow Jones average of industrials The Dow average

Examples of collocations • • The Dow Jones average of industrials The Dow average The Dow industrials *The Jones industrials The Dow Jones industrial *The industrial Dow *The Dow industrial

Collocation properties • Arbitrary (dialect dependent) – ride a bike, set the table •

Collocation properties • Arbitrary (dialect dependent) – ride a bike, set the table • Domain dependent – dry suit, wet suit • Recurrent • Cohesive – Part of a collocation primes for the rest

Applications • Lexicography • Grammatical restrictions (compare with/to but associate with) • Generation •

Applications • Lexicography • Grammatical restrictions (compare with/to but associate with) • Generation • Translation

Types of collocations • Predicative relations – make a decision, hostile takeover – flexible

Types of collocations • Predicative relations – make a decision, hostile takeover – flexible (syntactic variability, intervening words) • Rigid word groups – over the counter market • Phrases with open slots – fluency in a domain

Issues in finding collocations • Possibly more than two words – Need measure that

Issues in finding collocations • Possibly more than two words – Need measure that extends beyond the binary case • Possibly intervening words • Possibly morphological and syntactic variation • Semantic constraints (cf. doctors-dentists and doctors-hospitals)

Xtract stage one • For a given word, find all collocates at positions -5

Xtract stage one • For a given word, find all collocates at positions -5 to +5 • Three criteria: – strength (normalized frequency); 95% rejection vs. expected 68% under normal distribution – position histogram must not be flat – select peak from histogram

Xtract stage two • Start from word pairs • Look at each position in

Xtract stage two • Start from word pairs • Look at each position in between, to the left, and to the right • Keep words that appear very often • If that fails, keep parts of speech that satisfy this criterion

Xtract stage three • Applied to pairs of words • Requires (partial) parsing •

Xtract stage three • Applied to pairs of words • Requires (partial) parsing • Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e. g. , verb-object)

Evaluation • • Ask lexicographer to evaluate output 40% precision after stages one and

Evaluation • • Ask lexicographer to evaluate output 40% precision after stages one and two 80% precision after stage three 94% conditional recall

Terminology • Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of

Terminology • Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994 • Terms refer to concepts • Terms key for populating a domain ontology • Terms are typically nominal compounds of certain structure, e. g. , NN, N of N

Defining terms • Unique reference • Unique translation • Term extension by – modification

Defining terms • Unique reference • Unique translation • Term extension by – modification (e. g. , addition of an adjective) – substitution – extension of structure – coordination

Algorithm • Apply syntactic constraints to match pairs of words in a candidate term

Algorithm • Apply syntactic constraints to match pairs of words in a candidate term • Filter by application of an association measure • Measures examined: pointwise mutual information, Φ 2 (chi-square), log-likelihood ratio

Observations • • Compare with reference list Frequency a strong predictor Log-likelihood ratio works

Observations • • Compare with reference list Frequency a strong predictor Log-likelihood ratio works best Additional criteria: – diversity of the distribution of each word – distance between the two words (determines flexibility but not term status)

Justeson and Katz • Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an

Justeson and Katz • Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.

Analysis • Examined association measures • Well-known problems: – eliminating general-language constructs (e. g.

Analysis • Examined association measures • Well-known problems: – eliminating general-language constructs (e. g. , collocations) – what to do with single word terms?

Observations • Frequency works well • But a stronger predictor is P(k>1) compared to

Observations • Frequency works well • But a stronger predictor is P(k>1) compared to P(k≥ 1) in the same document • Use syntactic patterns to propose terms, then check if they reappear in the same document • Require this across multiple documents

Term Expansion • Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and

Term Expansion • Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997. • Need to expand a given list of terms, especially for scientific domains

Term variation • Syntactic (same words, different structure) • Morphosyntactic (derivational forms of words)

Term variation • Syntactic (same words, different structure) • Morphosyntactic (derivational forms of words) • Semantic (synonyms are used) • In IR, normalization through stemming and removal of stop words

Approach • Process corpus matching new candidate terms to old ones via unification •

Approach • Process corpus matching new candidate terms to old ones via unification • Matching based on – inflectional morphology (transducer) – derivational morphology (rule-based) – syntactic transformations – additions of words

Results • Manual inspection of several thousand proposed terms • Precision of 89% •

Results • Manual inspection of several thousand proposed terms • Precision of 89% • Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99. 7/72 to 97/93)