SEARCHING QUESTION AND ANSWER ARCHIVES Dr Jiwoon Jeon

Discussion n Current Information Retrieval systems?

OVERVIEW n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework

INTRODUCTION n n Q&A Retrieval problem Challenges n Semantically similar questions n n n

What is New? n n n New Type of Information System New Translation-based Retrieval

Q & A RETRIEVAL n Question & Answer Archives n n n Websites with

Q & A Retrieval (Contd. . ) n Advantages n n n Handle natural

CHALLENGES n Finding relevant Question & Answer Pairs n n n Importance of question

TEST COLLECTIONS n Components : n n Set of documents Set of information needs

WONDIR COLLECTION n n Earliest community based QA service in the US. 1 million

Queries n Closed-class questions that ask fact based short answers. n n Relevance Judgment

Web. FAQ COLLECTION by Jijkoun and Rijke n n n Collection of FAQs using

NAVER COLLECTION n n n Leading portal site in South Korea Community-based answering service

Naver Collection (Contd. . ) n n Question – Title & Body Naver Test

Translation Based Q&A Retrieval framework n n Use of Machine Translation technique for information

IBM Statistical Machine translation Models n n Do not require any linguistic knowledge of

IBM Models n Model 1 n n Model 2 n n Treats every possible

IBM Models (Contd. . ) n Model 4 n n n First order alignment

Advantages of Model 1 n n n Efficient implementation is possible using a form

IBM Model 1 Equation n The probability that a query Q of length m

Translation based Language Models n n Language model is a mechanism for generating text.

Language modeling approach to IR n n In maximum likelihood estimator, unseen words in

Language modeling approach to IR (Contd. . ) n The ranking function for the

IBM Model 1 vs. Query Likelihood n Comparable components in the two models

Self Translation Model n n n Every word has some probability to translate to

Trans. LM n Final ranking Function looks like

Efficiency Issues and Implementation of Trans. LM n Flipped Translation Tables

Properties of Word Relationships n n Not Symmetric Not fixed Change depending on retrieval

Training Sample Generation n Key Idea n n If two answers are very similar,

Word Relationship Types n P(Q|A) n n P(A|Q) n n n Source – Answer

EM Algorithm n Find word relationships that maximize the likelihood of sampling the target

EM Algorithm (Contd. . ) n The translation probability from a source word t

SUMMARY n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework

Coming Up Next… n n Estimating Answer Quality Experiments

Slides: 47

Download presentation

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR

Discussion n Current Information Retrieval systems?

OVERVIEW n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework Learning word-to-word translations

INTRODUCTION n n Q&A Retrieval problem Challenges n Semantically similar questions n n n Problem : Word mismatch problem Solution : Machine translation-based information retrieval model Quality of the Answers n n Problem : Many answers to a given question Solution : Answer Quality Prediction Technique

What is New? n n n New Type of Information System New Translation-based Retrieval Model New Document Quality Estimation Method Integration of Advances in Multiple research Areas New Paraphrase Generation Method Utilizing Web as a Resource for Retrieval

OVERVIEW n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework Learning word-to-word translations

Q & A RETRIEVAL n Question & Answer Archives n n n Websites with FAQ Community based question answering services Task Definition

Q & A Retrieval (Contd. . )

Q & A Retrieval (Contd. . ) n Advantages n n n Handle natural language questions Return answers instead of relevant documents Disadvantages n Can answer only previously answered questions

Q & A RETRIEVAL SYSTEM ARCHITECTURE

CHALLENGES n Finding relevant Question & Answer Pairs n n n Importance of question parts Word mismatch problem Estimating Answer Quality n Importance

OVERVIEW n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework Learning word-to-word translations

TEST COLLECTIONS n Components : n n Set of documents Set of information needs (queries) Set of relevance judgment Pooling Method

WONDIR COLLECTION n n Earliest community based QA service in the US. 1 million question and answer pairs used from this service Average question length = 27 words Average answer length = 28 words

Examples

Queries n Closed-class questions that ask fact based short answers. n n Relevance Judgment n n E. g. : Where is Charlotte located? 220 relevant Q&A pairs for 50 queries using pooling method. Relevance Judgment Criteria

Web. FAQ COLLECTION by Jijkoun and Rijke n n n Collection of FAQs using web crawlersmade public for research purposes. Found web pages that contain the word “FAQ”. Used heuristic methods to automatically extract question and answer pairs from the web pages.

NAVER COLLECTION n n n Leading portal site in South Korea Community-based answering service Collection A : n n Category information – To test category specific translations Collection B : n Non-Textual Information – To build answer quality prediction technique

Naver Collection (Contd. . ) n n Question – Title & Body Naver Test Collection A Naver Test Collection B Relevance : n Question semantically related to query and n n Question contains all query terms Q&A pair was clicked multiple times for the query.

Comparison of test Collections

OVERVIEW n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework Learning word-to-word translations

Translation Based Q&A Retrieval framework n n Use of Machine Translation technique for information retrieval Word mismatch problem n Translation based approach

IBM Statistical Machine translation Models n n Do not require any linguistic knowledge of the source or target language. Exploits only co-occurrence statistics of terms in training data.

IBM Models n Model 1 n n Model 2 n n Treats every possible word alignment equally Assumes only positions of terms are related to the word alignment Model 3 n The first term and the second term generated from the same term are independent

IBM Models (Contd. . ) n Model 4 n n n First order alignment model Every word is dependent only on the previous aligned word. Model 5 n Reformulation of Model 4

Advantages of Model 1 n n n Efficient implementation is possible using a form of query expansion. Performance gain of using low level translation models is high. Can be easily integrated into the query likelihood

IBM Model 1 Equation n The probability that a query Q of length m is the translation of a document D (of length n) is given as

IBM Model 1 Equation

Translation based Language Models n n Language model is a mechanism for generating text. Unigram language model n n Assumes each word is generated independently Concerns only probabilities of sampling a single word.

Language modeling approach to IR n n In maximum likelihood estimator, unseen words in a document have zero probability. Smoothing : n n Transfers some probability mass from the seen words to the unseen words. Dirichlet smoothing – good performance and cheap computational cost.

Language modeling approach to IR (Contd. . ) n The ranking function for the query likelihood language model with Dirichlet smoothing can be written as

IBM Model 1 vs. Query Likelihood n Comparable components in the two models

Self Translation Model n n n Every word has some probability to translate to itself. Cannot be 1 If too low – deteriorate retrieval performance

Trans. LM n Final ranking Function looks like

Efficiency Issues and Implementation of Trans. LM n Flipped Translation Tables

Term-at-a-time Algorithm

OVERVIEW n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework Learning word-to-word translations

Properties of Word Relationships n n Not Symmetric Not fixed Change depending on retrieval or translation tasks. must be given as probability values.

Training Sample Generation n Key Idea n n If two answers are very similar, then the corresponding questions are semantically similar. Similarity Measures n n n Cosine Similarity Query Likelihood scores between two answers (LM SCORE) LM-HRANK

Word Relationship Types n P(Q|A) n n P(A|Q) n n n Source – Answer ; Target – Question Source – Question ; Target – Answer P(Q|Q) P(Q<->Q)

EM Algorithm n Find word relationships that maximize the likelihood of sampling the target text from the source text in training samples.

EM Algorithm (Contd. . ) n The translation probability from a source word t to a target word w is given as

Examples

Examples (Contd. . )

SUMMARY n n n Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval framework Learning word-to-word translations

Coming Up Next… n n Estimating Answer Quality Experiments