Good Question Statistical Ranking for Question Generation Michael

Agenda • • • Introduction Related Work Three-stage Framework AQG Evaluation Conclusion Comments

Introduction(1/3) • In this paper, we focus on question generation (QG) for the creation

Introduction(2/3) • Los Angeles become known as the “Queen of the Cow Counties” for

Introduction(3/3) • Question transformation involves complex long distance dependencies. • The characteristics of such

Related Work • None has involved statistical models for choosing among output candidates. •

Research Objectives • We apply statistical ranking to the task of generating natural language

Three-stage Framework AQG • We define a framework for generating a ranked set of

Stage 1 Transforming Source Sentence • Each of the sentences from the source text

Stage 1 Transforming Source Sentence • Complex source sentence: Prime Minister Vladimir V. Putin,

Stage 2 Question Transducer • The declarative sentences derived in step 1 are transformed

Stage 2 Question Transducer • In English, various constraints determine whether phrases can be

Stage 2 Question Transducer • After marking unmovable phrases, we iteratively remove each possible

Stage 2 Question Transducer • The system annotates the source sentence with a set

Stage 2 Question Transducer • It also includes dates, times, monetary units, and others.

Stage 2 Question Transducer • In order to perform subject-auxiliary inversion – if an

Stage 2 Question Transducer • In order to convert between lemmas of verbs and

Stage 2 Question Transducer • The transducer performs subject-auxiliary inversion either when the question

Stage 2 Question Transducer • Sentence-final periods are changed to question marks. • The

Stage 3 Question Ranker • Since different sentences and transformations of source sentences, may

Stage 3 Question Ranker • For the test set, each question was rated by

Ranking • Why do we over-generate and rank questions? – Name entity recognition error

Feature Set Type Feature Value Type Length the numbers of tokens in the question,

Type Feature Value Type Grammatical the numbers of proper nouns, pronouns, adjectives, adverbs, conjunctions,

Evaluation • The results of experiments to evaluate the quality of generated questions before

Results for Unranked Questions • 27. 3% of test set questions were labeled acceptable

Slides: 31

Download presentation

Good Question! Statistical Ranking for Question Generation Michael Heilman and Noah A. Smith The North American Chapter of Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2010)

Agenda • • • Introduction Related Work Three-stage Framework AQG Evaluation Conclusion Comments

Introduction(1/3) • In this paper, we focus on question generation (QG) for the creation of educational materials for reading practice and assessment. • Our goal is to generate fact-based questions about the content of a given article. • The top-ranked questions could be filtered and revised by educators, or given directly to students for practice. • Here we restrict our investigation to questions about factual information in texts.

Introduction(2/3) • Los Angeles become known as the “Queen of the Cow Counties” for its Consider the following sentence from role in supplying beef and other the. Wikipedia article on the history foodstuffs to hungry miners in the of Los north Angeles During the Gold Rush years in northern California, Los Angeles became known as the “Queen of the Cow Counties” for its role in supplying beef and other foodstuffs to hungry miners in the north. What did Los Angeles become known as the “Queen of the Cow Counties” for?

Introduction(3/3) • Question transformation involves complex long distance dependencies. • The characteristics of such phenomena are difficult to learn from corpora, but they have been studied extensively in linguistics. • However, since many phenomena pertaining to question generation are not so easily encoded with rules, we include statistical ranking as an integral component. • Thus, we employ an overgenerate-andrank approach.

Related Work • None has involved statistical models for choosing among output candidates. • Mitkov et al. (2006) demonstrated that automatic generation and manual correction of questions can be more time-efficient than manual authoring alone. • Existing QG systems model their transformations from source text to questions with many complex rules for specific question types (e. g. , a rule for creating a question Who did the Subject Verb? from a sentence with SVO word order and an object referring to a person), rather than with sets of general rules.

Research Objectives • We apply statistical ranking to the task of generating natural language questions. • We model QG as a two-step process of first simplifying declarative input sentences and then transforming them into questions. • We incorporate linguistic knowledge to explicitly model well-studied phenomena related to long distance dependencies in WH questions. • We develop a QG evaluation methodology, including the use of broad-domain corpora.

Three-stage Framework AQG • We define a framework for generating a ranked set of fact-based questions about the text of a given article. • From this set, the top-ranked questions might be given to an educator filtering and revision, or perhaps directly to a student for practice.

Stage 1 Transforming Source Sentence • Each of the sentences from the source text is expanded into a set of derived declarative sentences by altering lexical items, syntactic structure, and semantics. • In our implementation, a set of transformations derive a simpler form of the source sentence by removing phrase types such as leading conjunctions, sentence-level modifying phrases, and appositives.

Stage 1 Transforming Source Sentence • Complex source sentence: Prime Minister Vladimir V. Putin, the country's paramount leader, cut short a trip to Siberia, returning to Moscow to oversee the federal response. • Extracted factual sentences: • Prime Minister Vladimir V. Putin cut short a trip to Siberia. • Prime Minister Vladimir V. Putin was the country's paramount leader. • Prime Minister Vladimir V. Putin returned to Moscow to oversee the federal response.

Stage 2 Question Transducer • The declarative sentences derived in step 1 are transformed into sets of questions by a sequence of well-defined syntactic and lexical transformations (subject-auxiliary inversion, WHmovement, etc. ). • It identifies the answer phrases which may be targets for WH-movement and converts them into question phrases. Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 2 Question Transducer • In English, various constraints determine whether phrases can be involved in WH-movement and other phenomena involving long distance dependencies. What did John like? John liked the *Who did John like the book that I gave • Forhim. example, noun phrases are to book“islands” that gave him? movement, meaning that constituents dominated by a noun phrase typically cannot undergo WH-movement. Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 2 Question Transducer • After marking unmovable phrases, we iteratively remove each possible answer phrase. • The question phrases for a given answer phrase consist of a question word (e. g. , who, what, where, when), possibly preceded by a preposition and, in the case of question phrase like whose car, followed by the head of the answer phrase. Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 2 Question Transducer • The system annotates the source sentence with a set of entity types taken from the BBN Identifinder Text Suite and generate a final question. • The set of labels from BBN includes those used in standard named entity recognition tasks (e. g. , “PERSON, ” “ORGANIZATION” and their corresponding types for common nouns (e. g. , “PER DESC, ” “ORG DESC”). Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 2 Question Transducer • It also includes dates, times, monetary units, and others. • For a given answer phrase, the system uses the phrase’s entity labels and syntactic structure to generate a set of zero or more possible question phrases, each of which is used to generate a final question sentence. Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 2 Question Transducer • In order to perform subject-auxiliary inversion – if an auxiliary verb or modal is not present, the John saw Mary. → John did see Mary. verb question transducer decomposes the main → Who did John see? into the appropriate form of do and the base form of the main verb. John has seen Mary. – If an auxiliary verb is already present, however, this decomposition is not necessary. → Who has John seen? Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 2 Question Transducer • In order to convert between lemmas of verbs and the different surface forms that correspond to different parts of speech, we created a map from pairs of verb lemma and part of speech to verb surface forms. • We extracted all verbs and their parts of speech from the Penn Treebank. • We lemmatized each verb first by checking morphological variants in Word. Net, and if a lemma was not found, then trimming the rightmost characters from the verb one at a time until a matching entry in Word. Net was found.

Stage 2 Question Transducer • The transducer performs subject-auxiliary inversion either when the question to be generated is a yes-no question or when the answer phrase is a non-subject noun phrase. • Each possible question phrase is inserted into a copy of the tree to produce a question. Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 2 Question Transducer • Sentence-final periods are changed to question marks. • The output of our system that nearly all of the questions including pronouns were too vague (e. g. , What does it have as a head of state? ). • Therefore, to filter all questions with personal pronouns, possessive pronouns, and noun phrases consisting solely of determiners (e. g. , those). Declarative Sentence Mark Unmovable Phrases Question Perform Post-processing Generate Possible Question Phrase * (Decompose Main Verb) Insert Question Phrase (Invert Subject and Auxiliary)

Stage 3 Question Ranker • Since different sentences and transformations of source sentences, may be more or less likely to lead to high-quality questions. • Fifteen native English-speaking university students rated a set of questions produced from stages 1 and 2. • For a predefined training set, each question was rated by a single annotator (not the same for each question), leading to a large number of diverse examples.

Stage 3 Question Ranker • For the test set, each question was rated by three people (again, not the same for each question) to provide a more reliable gold standard. • An inter-rater agreement of Fleiss’s k = 0. 42 was computed from the test set’s acceptability ratings. Source English Wikipedia Simple English Wiki Wall Street Journal Training set 1328/12 1195/16 284/8 Testing set 120/2 118/2 190/2 Total 2807/36 428/6

Ranking • Why do we over-generate and rank questions? – Name entity recognition error – Parsing error – Transformation error • Therefore, We use a discriminative ranker specifically based on a logistic regression model that defines a probability of acceptability. M. Collins. 2000. Discriminative reranking for natural language parsing. In Proc. of ICML.

Feature Set Type Feature Value Type Length the numbers of tokens in the question, the source sentence, and the answer phrase from which the WH phrase was generated integer Negation the presence of not, never, or no in the question boolean N-Gram Language Model the log likelihoods and length-normalized log likelihoods of the question, the source sentence, and the answer phrase real value

Type Feature Value Type Grammatical the numbers of proper nouns, pronouns, adjectives, adverbs, conjunctions, numbers, noun phrases, prepositional phrases, and subordinate clauses in the phrase structure parse trees for the question and answer phrase integer Transformations the possible syntactic transformations(e. g. , removal of appositives and parentheticals, choosing the subject of source sentence as the answer phrase) binary Vagueness the numbers of noun phrases in the question, source sentence, and answer phrase that are potentially vague integer

Evaluation • The results of experiments to evaluate the quality of generated questions before and after ranking. • The evaluation metric we employ is the percentage of test set questions labeled as acceptable. • For rankings, our metric is the percentage of the top N% labeled as acceptable, for various N.

Results for Unranked Questions • 27. 3% of test set questions were labeled acceptable (i. e. , having no deficiencies) by a majority of raters.

Results for Ranking

Ablation Result

Recall

Online Demo