Machine learning for query formulation in question answering

































- Slides: 33
Machine learning for query formulation in question answering Based on an article by: Christof Monz
Question Answering (QA) (Wikipedia): a computer science discipline within the fields of information retrieval (IR) and natural language processing (NLP) which is concerned with building systems that automatically answer questions posed by humans in a natural language.
Formulation of queries Predicting the importance of terms from the original question, in order to form an effective query*. (*a query with a high accuracy in identifying documents containing the answer).
NLP is a good thing. . The user's information need is expressed as a well-formed sentence (as opposed to sets of keywords), which allows us to analyze the sentence using NLP processing tools. Introduction
Bad idea. . The question: What is the abbreviation of London stock exchange? The query: abbreviation London stock exchange BUT it is more likely that the document containing the answer doesn't include the word "abbreviation". . '. . . London Stock Exchange (LSE). . . ' Introduction
taken from AI course by J. Rosenschein Useful background
taken from AI course by J. Rosenschein Useful
taken from AI course by J. Rosenschein Useful background
Some more decisions. . A decision tree needs to decide: When to split on which attribute? (i. e. how to pick) What is the appropriate root? (i. e. top-most attribute) When to stop? (continue splitting until. . ) • • • Useful background
taken from AI course by J. Rosenschein Useful background
Regression Trees Similar to the decision tree, except it can handle continuous features values, and instead of a decision tag at the leaves, there a function values called classes. We're going to use that Useful background
Learning term weights The input for our learning algorithm will be a set of feature vectors. ft. 1 ft. 2 ft. N For each term (represented by a feature vector) we'll compute its weight(class), and the machine learning algorithm should learn to predict the degree of the query term's usefulness for query formulation.
Back to query formulation We're going to present a set of features that will help us predict the importance of a term in a query. But first: Our training set: TREC-9/10/11 Our oracle: NIST's judgement files • •
Text Retrieval Conference (TREC) on-going series of workshops focusing on a list of different information retrieval research areas. TREC-9/10/11 - data sets consisting of 500 different questions. Also, the AQUAINT document collection. NIST's judgment file- a file which indicates for each submitted answer whether it is correct or not. (Participants were required to return <answer, doc-id> pairs)
The features
1. Part of Speech (POS) Finding out the syntactic category of a term in a sentence (values: N-noun, V-verb. . ) By using TREETAGGER (a decision-tree-based tagger), we'll create a Treebank (a tree in which every leaf represents a word from the sentence, and the branch leading to that leaf defines the syntactic category). Features
Treebank example "John Loves Mary" Simple clause verb phrase Noun phrase verb in present test proper name singular Features
2. Question Focus A phrase describing the ontological superclass of the answer. Examples: In what country did the game of croquet originated? • • What college did Magic Johnson attend? Features
Focus can harm In what country did the game of croquet originated? The answer is France, which is a country (i. e. France is a country). However, it is considered to be common knowledge and therefore seldomly stated explicitly in a document. BUT not necessarily. . Q: What is the deepest lake in America? A: Crater Lake is America's deepest Lake. . Features
A term can receive the following values: 0 - If the word isn't part of the question focus. 1 - Is part of the question focus and the semantic head of a noun phrase. 0. 5 - Is part of the question focus, but not the semantic head of a noun phrase. • • • Example: What mythical Scottish town appears for 1 day every 100 years? mythical, Scottish = 0. 5 Features
How to Focus In order to find the question focus we use MINIPAR (a robust full dependency parser). Using the directed dependency graph made by MINIPAR, we focus on the outgoing arcs of nodes representing the wh-words, because they modify the question focus. Features
Features
3. location - Boolean (is the word is part of a location name? ) 4. abbreviation - Boolean (is the word an abbreviation? ) 5. superlative - Boolean (does the question contain a superlative adjective? ) such as the "deepest lake" example 6. upper case - Boolean (does the word start with an uppercase letter? ) such as a song's name ("Happy Together") 7. question class - A fixed list of question classes question type: date, location, agent etc. 8. classif. word - Boolean (was the word used to classify the question? ) What province is Calgary located in? -location 9. multpl. occurr. - Boolean (does the word occur more than once in the question? ) Features
10. person name - A fixed set of values indicating what part of a person’s name the word is, if applicable. 11. quoted - Boolean (does the word occur between quotation marks? ) 12. honorific - Boolean (is the word a honorific term (e. g. , Dr. )? ) 13. modified noun - Boolean (is the word a noun that is preceded (modified) by another noun? ) Who holds the record as the highest paid child performer? 14. no. edges - A natural number indicating the number of edges pointing to a word in the dependency parse graph of the question. more edges role more relationships more likely to play a main 15. term ratio - 1/m, where m is the number of unique terms in the question. Features
16. hypernym - Boolean (is the word a Word. Net* hypernym of another word in the question? ) In what country did the game of croquet originate? 17. no. leaves - The number n (n ≥ 0) of hyponyms of the word in the Word. Net hierarchy that do not have any further hyponyms themselves person vs. explorer 18. relative idf - A real value indicating the relative frequency of the word in the document collection compared to the frequencies of the other words in the question *Word. Net- Lexical DB (for English words). Contains synsets of synonyms. Features
Features
Average precision of a query Given a question, we look at all the query variants and look for the one with the highest average precision: Set of relevant documents ratio of REL that their rank >=d Where q is a query variant, and we are given a ranking function :
Terms initial weight The presence and absence of weight of term t, also called gain, can be computed as: where: *Here q is a question and tsv(q) all its query variants.
Putting it all together The previously shown gain is our initial term gain, and according to the chosen features in the tree (based on an inner function in the regression tree algorithm), we change it. Our learning algorithm should learn to predict the degree of the query term's usefulness for query formulation.
A learned model tree
The features values
Results* Correlation coefficient - indicates the degree to which the predicted value and the original values correlate. 1 (− 1) indicates perfect (inverse) correlation, and a value of 0 indicates no correlation at all. Here the predicted and original values are weakly correlated. The error is rather high, but better than choosing the mean training value • *comparing the predicted term weight to the original
Conclusion can benefit from using our approach. In some cases the issue of whether a term is helpful for retrieving answer documents simply depends on idiosyncrasies of the documents that contain an answer, but our training sets were fairly large and varied in order to generalize properly. In addition, our proposed term weight learning approach yields significantly better results than passage-based retrieval approaches commonly used for document retrieval in the context of question answering.