LREC 2010 La Valletta Malta May 2010 Handling

  • Slides: 16
Download presentation
LREC 2010, La Valletta, Malta, May 2010 Handling of missing values in lexical acquisition

LREC 2010, La Valletta, Malta, May 2010 Handling of missing values in lexical acquisition Núria Bel Universitat Pompeu Fabra GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 1

Handling of missing values in lexical acquisition By Automatic Lexical Information Acquisition we. .

Handling of missing values in lexical acquisition By Automatic Lexical Information Acquisition we. . find how to build repositories of language dependent lexical information automatically. Many technologies behind applications (MT, IE, Automatic Summarization, Sentiment Analysis, Opinion Mining, Question Answering, etc. ) do need this information to work LREC 2010, La Valletta, Malta, May 2010 try to ("paralelo" AST ("fiesta" NST ALO "paralel" ALO "fiest" ATR POST CL (PF-AS SF-A) CL (PF-AS PM-OS SF-A SM-O) GD (F) FC (NPP) KN MS LY AMENTE PLC (NF) MC ("a") PLC (NG) TYN (ABS) PRED (ESTAR SER) AUTHOR "juan" TA (OBJ-P REL) DATE "28 -Aug-99" AUTHOR "juan" SITE "FB 52") DATE "31 -Aug-99" SITE "FB 52") GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) Entries borrowed from MT system Incyta (Metal family) 2

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Cue Based Lexical Acquisition • Differences in the distribution of certain contexts separate words of different classes (Harris, 1951). • For example: some / *many mud • Words (types) can be represented in terms of a collection of contexts where their occurrence or not in these contexts is taken as hints or cues for a word to be classified as being of a particular class. GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 3

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Word’s occurrences are represented as vectors and used to train a classifier. @data 15, 2, 8, 4, 0, 8, 1, 0, 0, 0 Number of times the word has been observed in each of the defined contexts. Non occurrence in particular contexts is as informative as occurrence. We use supervised classifiers (Support Verb Machines, Decision Trees) to predict the class (Abstract, Mass, etc. ) of new words. GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 4

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Cues, classification and state-of-the-art results • Merlo and Stevenson (2001) selected very specific cues for classifying verbs into a number of Levin (1993) based verbal classes: animacy of the subject, passives, . . . • Baldwin (2005) used general features, such as the pos tags of neighboring words for type classification. • Joanis et al. (2007) used the frequency of filled syntactic positions or slots, tense and voice of occurring verbs, etc. , to describe the whole system of English verbal classes. • Difficult to compare the results, but. . an accuracy of about 70% GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 5

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 The problem: missing values GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 6

Handling of missing values in lexical acquisition The Sparse data problem • • LREC

Handling of missing values in lexical acquisition The Sparse data problem • • LREC 2010, La Valletta, Malta, May 2010 • • • Joanis and Stevenson, 2003; Joanis et al. 2007; Korhonen et al. 2008 mention that they have to face the problem of sparse data, many of the types/words are low in frequency and show up very little information. Most of the words will appear very little (i. e. Zipff distribution) and therefore will show few cues. Yallop et al. (2005) calculated that in the 100 M-word British National Corpus, from a total of 124, 120 distinct adjectives, 70, 246 occur only once. The cues we can use as information are mutually exclusive, i. e. an adjective can be prenominal and postnominal, but if it only occurs once, it will only show one cue, the other ones being a zero value. Even when appearing more frequently, the optional nature and variety of the contexts of occurrence are the origin of missing values also for those types that occur more than once. GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 7

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Zero values and learning • Zero values create not only a problem of enough information to decide, but a further uncertainty when learning from the data. • A zero value could be indeed a negative value, i. e. the cue is that it has not been observed, but it could be that the cue was just not observed in the examined corpus because of various reasons • When there are many zero values, the cue loses its predictive power because of the mentioned uncertainty. • Katz (1987) and Baayen and Sproat (1996), among others, acknowledged the importance of preprocessing low frequency events and Joanis et al. (2007) also decided to smooth the data, even working with more than 1000 occurrences per verb in the BNC. GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 8

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Our smoothing experiment: Harmonization based on linguistic information GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 9

Handling of missing values in lexical acquisition Intuitively: How likely is that a 0

Handling of missing values in lexical acquisition Intuitively: How likely is that a 0 is just an unobserved feature and not a true 0, given the values of other observations? To classify Abstract/Concrete nouns in English: LREC 2010, La Valletta, Malta, May 2010 Cue 1 is “suffix “–ness”, “-ism”, …. For Abstracts (Light 1996) Cue 2 is “determiners “such”, “little”, much”. . For Abstracts Cue 3 is “adjectives like “big”, “small”, … For Concrete P(cue_1=1|[0, 1, 0]) = P(abstract=yes|[0, 1, 0])* P(cue_1=1|abstract=yes) + P(abstract=no|[0, 1, 0]) * P(cue_1=1|abstract=no) GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 10

LREC 2010, La Valletta, Malta, May 2010 Handling of missing values in lexical acquisition

LREC 2010, La Valletta, Malta, May 2010 Handling of missing values in lexical acquisition • We use the information of observed features to assess the likelihood of a particular unobserved cue. • Harmonization is substituting 0 values by the likelihood of being 1 given the other cues observed. • BUT … In order to get P(cue_1=1|[0, 1, 0]) we need to have P(cue_n|class) and for all cues in the vector. GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 11

Handling of missing values in lexical acquisition The challenge: how to get P(cue_n|class) with

Handling of missing values in lexical acquisition The challenge: how to get P(cue_n|class) with so many 0’s in the data… ? LREC 2010, La Valletta, Malta, May 2010 By estimating the P(cue_n|class) with linguistic information Suffix=no Suffix=yes SC_Adj=no SC_Adj=yes Abstract 0. 5 1. 0 0. 0 Concrete 1. 0 0. 5 “The probability of being Concrete and having suffix “ness” is 0” GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 12

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Harmonization effects in Spanish Mass experiment Harmonized Frequency types 0, 1, 1, 0, 0, 1, 1, 0 0, 3, 0, 1, 1, 0, 0, 1, 1, 0 agua (‘water’) 1, 1, 0. 5, 1, 1, 0, 0, 0 1, 2, 0, 0, 0, 2, 1, 1, 2, 0, 0, 0 acero (‘steel’) 0. 5, 1, 0. 5, 0, 0, 0, 1, 0, 0, 0, 0 desabastecimiento (‘shortage’) 0. 02, 0. 47, 0. 47 0, 0, 0, 0, 0 aceptabilidad (‘acceptability’) GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 13

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Results of the experiments Experiment Mean Trimmed mead Frequency Harmonized Baseline Spanish Mass DT SVM 74. 2 63. 8 77. 5 67. 4 79. 9 79. 1 82. 8 80. 7 74. 8 English Abstract DT SVM 57. 8 61. 0 55. 6 61. 0 61. 4 64. 1 76. 1 70. 1 61. 5 GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 14

Handling of missing values in lexical acquisition Error Analysis & Future work LREC 2010,

Handling of missing values in lexical acquisition Error Analysis & Future work LREC 2010, La Valletta, Malta, May 2010 • Frequency information to filter noise has been neutralized • Future work is about how to handle missing values and noise together. GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 15

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010

Handling of missing values in lexical acquisition LREC 2010, La Valletta, Malta, May 2010 Thanks for your attention ! GRUP DE TECNOLOGIES DELS RECURSOS LINGÜÍSTICS (TRL) 16