Question Answering and Logistics Class Logistics Comments on

Class Logistics § Comments on proposals will be returned next week and may be

Next week: 2/21 § Invited speaker: John Prager, IBM § Location: 7 th Floor

Using Knowledge-Based Constraints to Improve Question-Answering Accuracy § Most Question-Answering systems use a combination

Questions Kristen - Hermjaikob The reformulation technique described in [Hermjakob et al. ] require

Kristen - annotation § All of the papers rely heavily on data annotation, e.

Kristen and Mahav - Evaluation § TREC has spurred a lot of research into

Madhav – Statistical vs. Symbolic § 1. Moldovan et al. describe a heuristic based

Madhav – Complexity § 3. Current QA systems seem to have to perform a

Slides: 9

Download presentation

Question Answering and Logistics

Class Logistics § Comments on proposals will be returned next week and may be available as early as Monday § Look at website for slides § Grades on presentations and discussant questions 2

Next week: 2/21 § Invited speaker: John Prager, IBM § Location: 7 th Floor Interschool Lab § Class structure: § § § First half: talk Second half: discussion Discussants: David Smith, Narayan Others: raise questions This is your chance to find out the details of how a system works 3

Using Knowledge-Based Constraints to Improve Question-Answering Accuracy § Most Question-Answering systems use a combination of statistical and symbolic techniques: for example, almost all use a search component, which fetches documents and/or passages using statistical matching formulae, and answer selection techniques which are often more linguistically-informed. The QA system at IBM Research, which has performed well in TREC-QA over the years, is no different in those respects, but we have at the same time been exploring various knowledge-based filtering chniques to constrain candidate answers. I will describe three such techniques. The first, which has been part of our core system since 1999, is what we call Predictive Annotation, a form of semantic indexing in which the answer type is a required term in the search engine query, greatly reducing the number of passages that need to be considered. QA-by-Dossier asks additional questions to the one from the user, and enforces real-world constraints between the different questions and answers, on the assumption that only correct answers will provide a consistent model. Finally, Question Inversion is a specific form of QA-by-Dossier in which initial candidate answers are inserted into a reformulated question with a term removed, with the expectation that only the correct answer will allow the removed term to be recovered. I will present experimental results from using these techniques, and discuss their pros and cons. 4

Questions Kristen - Hermjaikob The reformulation technique described in [Hermjakob et al. ] require a “person in the loop” to generalize phrase synonyms from automatically-extracted patterns. These are likely to be high-quality at the cost of being low-coverage (420 assertions). § § § Some types of data are naturally suited to knowledge bases, for example, dictionaries, synonyms, lists of countries, etc. Word. Net is an example of a highly successful knowledge base that is used a lot in NLP. Are phrasal reformulations suitable for storing in a knowledge base? ie, if we could expand their system to contain a million assertions, would this hand-crafted database be sufficient/useful? These reformulations require the authors to identify “anchor-patterns” that can be reformulated; only questions that match these patterns can be expanded. What are good resources for finding anchor patterns, ie, for finding questions that users are likely to ask? (besides prior TREC evaluations) Can you think of ways that we could automatically extract patterns like this? Or ways where the human in the loop could do a lot less work? 5

Kristen - annotation § All of the papers rely heavily on data annotation, e. g. , marking names, numbers, syntactic parsing, dependency parsing, etc. However, automatic annotation is never perfect. (For example, the best named entity tagger gets 90 -95% F-measure on English newswire, which is a very good score, but this means that 1 out of every 10 -20 named entities is incorrectly tagged!) § Can automatic annotation ever be “perfect”? What implications does this have for higher-order NLP processing, such as question answering? § What measures do these systems take to address these problems? 6

Kristen and Mahav - Evaluation § TREC has spurred a lot of research into question answering over newswire, and created reusable data for system comparison. § What are some of the downsides to having a standardized “bake-off” with a shared corpus/language/question type? If you worked for a web search company, would you implement the systems we've read about? §. Is the TREC evaluation methodology suitable? Should the answer nuggets be constructed independently of system outputs? What would you do to evaluate a QA system? 7

Madhav – Statistical vs. Symbolic § 1. Moldovan et al. describe a heuristic based answer extraction system whereas Ittycheriah et al. use a statistically driven method. What are the drawbacks and advantages of each? Would these techniques, given the various heuristics (or features) used, adapt to different domains? Moldovan et al. claim that their system is open-domain. 8

Madhav – Complexity § 3. Current QA systems seem to have to perform a great deal of processing in order to fulfill the task. When systems are as complex as this, and are madeup of many sub-components, is it at all possible (say, as a researcher) to pin-point the "weak-link" within a system in order to improve its performance? Can one evaluate the sub-components independently? Is that necessary? For instance, IBM used a dependency parser - do you think that similar higher-level NLP techniques would be more helpful in solving the problem at hand? Or will lower level bag-of-words (IR) techniques be more suitable? 9