Application of NLP in Information Retrieval Presentation Outline

Presentation Outline Overview of current IR Systems n Problems with NLP in IR n

Motivation n Most successful general purpose retrieval methods are statistical methods. n Sophisticated linguistic

What is IR ? ? n “Information retrieval system is one that searches a

The problem of IR n Goal = find documents relevant to an information need

Basics of IR Systems (contd…) n Indexing the collection of documents. n Transforming the

Basics of IR Systems (contd…) n Retrieval Systems consist of mainly two processes: ¨

Indexing n Indexing is the process of selecting terms to represent a text. n

Information Retrieval Models n A retrieval model consists of: ¨ ¨ ¨ D: representation

Boolean Model Queries are represented as Boolean combinations of the terms. n Set of

Vector Space Model n n In this model documents and queries are represented by

Matching n n Matching is the process of computing a measure of similarity between

Evaluation of IR Systems n n n Two common effectiveness measures include: ¨ Precision:

Case Study Query: I need to know the gas mileage for my audi a

Case Study (contd…) Query: I need to know the gas mileage for my audi

Case Study (contd…) n Yahoo Search ¨ Pure text-based search. ¨ Result generates instance

Conclusion n Research efforts to address appropriate tasks are underway. E. g. document summarization,

References n n n Voorhees, EM, "Natural Language Processing and Information Retrieval, " in

Slides: 23

Download presentation

Application of NLP in Information Retrieval

Presentation Outline Overview of current IR Systems n Problems with NLP in IR n Major applications of NLP in IR n

Motivation n Most successful general purpose retrieval methods are statistical methods. n Sophisticated linguistic processing often degrade performance.

What is IR ? ? n “Information retrieval system is one that searches a collection of natural language documents with the goal of retrieving exactly the set of documents that pertain to a users question” Have their origins in library systems n Do not attempt to deduce or generate answers n

The problem of IR n Goal = find documents relevant to an information need from a large document set Info. need Query Document collection Retrieval IR system Answer list 5

Basics of IR Systems

Basics of IR Systems (contd…) n Indexing the collection of documents. n Transforming the query in the same way as the document content is represented. n Comparing the description of each document with that of the query. n Listing the results in order of relevancy.

Basics of IR Systems (contd…) n Retrieval Systems consist of mainly two processes: ¨ Indexing ¨ Matching

Indexing n Indexing is the process of selecting terms to represent a text. n Indexing involves: ¨ Tokenization of string ¨ Removing frequent words ¨ Stemming (removing ing, ed, n etc) Two common Indexing Techniques: ¨ Boolean Model ¨ Vector space model

Indexing

Information Retrieval Models n A retrieval model consists of: ¨ ¨ ¨ D: representation for documents R: representation for queries F: a modeling framework for D, Q R(q, di): a ranking or similarity function which orders the documents with respect to a query. In this, tokens are treated in the form of 1’s and 0’s

Boolean Model Queries are represented as Boolean combinations of the terms. n Set of documents that satisfied the Boolean expression are retrieved in response to the query. n Drawback n ¨ User is given no indication as to whether some documents in the retrieved set are likely to be better than others in the set

Vector Space Model n n In this model documents and queries are represented by vectors in T dimensional space. T is the number of distinct terms used in the documents. Each axis corresponds to one term. Ranked list of documents ordered by similarity to the query where similarity between a query and a document is computed using a metric on the respective vectors.

Matching n n Matching is the process of computing a measure of similarity between two text representations. Relevance of a document is computed based on following parameters: ¨ tf - term frequency is simply the number of times a given term appears in that document. tfi. j = (count of ith term in jth document)/(total terms in jth document) ¨ idf - inverse document frequency is a measure of the general importance of the term idfi = (total no. of documents)/(no. of documents containing ith term) ¨ tfidfi, j score = tf * idf

Evaluation of IR Systems n n n Two common effectiveness measures include: ¨ Precision: Proportion of retrieved documents that are relevant. (it is near to accuracy) Precision= no. of retrieved relevant documents/total no. of relevant documents ¨ Recall: Proportion of relevant documents that are retrieved. Recall= no. of retrieved relevant documents/total no. of retrieved documents Ideally both precision and recall should be 1. In practice, these are inversely related.

Case Study Query: I need to know the gas mileage for my audi a 8 2004 model Source: Yahoo search (search. yahoo. com)

Case Study (contd…) Query: I need to know the gas mileage for my audi a 8 2004 model Source: Y!Q search (yq. search. yahoo. com)

Case Study (contd…) Query: I need to know the gas mileage for my audi a 8 2004 model Source: Google search (www. google. com)

Case Study (contd…) n Yahoo Search ¨ Pure text-based search. ¨ Result generates instance of same text containing documents. n Y!Q Search ¨ Use of semantics but not efficient. ¨ Attempts to generate answer. However this is done less efficiently here. n Google Search ¨ Efficient use of NLP for deduction of answer form given question. ¨ A step towards question-answering !!

Conclusion n Research efforts to address appropriate tasks are underway. E. g. document summarization, generating answers. n Achieving extremely efficient NLP techniques is an idealization.

References n n n Voorhees, EM, "Natural Language Processing and Information Retrieval, " in Pazienza, MT (ed. ), Information Extraction: Towards Scalable, Adaptable Systems, New York: Springer, 1999. Salton G Wong A Yang CS A Vector Space Model for Automatic Indexing Communications of the ACM (1975) 613 -620. Mari Vallez; Rafael Pedraza-Jimenez. Natural Language Processing in Textual Information Retrieval and Related Topics "Hipertext. net", num. 5, 2007. Sanjeet Khaitan, Kamaljeet Verma and Pushpak Bhattacharyya, Exploiting Semantic Proximity for Information Retrieval, IJCAI 2007, Workshop on Cross Lingual Information Access, Hyderabad, India, Jan, 2007. Wikipedia

Questions ? ?

Thank You !!!!!