Challenges in Information Retrieval and Language Modeling Michael

  • Slides: 34
Download presentation
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada

Report of a Workshop The following presentation is based on: James Allan, et al.

Report of a Workshop The following presentation is based on: James Allan, et al. , “Challenges in Information Retrieval and Language Modeling”. Report of a Workshop held in the Center for Intelligent Information Retrieval, University of Massachusetts Amherst, September 2002.

Long-Term Challenges • LT Challenge 1 – Global Information Access – Satisfy human information

Long-Term Challenges • LT Challenge 1 – Global Information Access – Satisfy human information needs through natural, efficient interaction with an automated system that leverages world-wide structured and unstructured data in any language • Need – Massively distributed, multi-lingual retrieval systems – Techniques from distributed retrieval, data fusion, cross-lingual IR

Long-Term Challenges • LT Challenge 2 – Contextual Retrieval – Combine search technologies and

Long-Term Challenges • LT Challenge 2 – Contextual Retrieval – Combine search technologies and knowledge about query and user context into a single framework in order to provide the most “appropriate” answer for a user’s information needs • Need – Context and query features to infer characteristics of the info need such as query type, answer level, task etc.

Query Activity User Information Need Task User Profile

Query Activity User Information Need Task User Profile

Topics 1. Retrieval Models 2. Cross-Lingual information Retrieval 3. Web Search 4. User Modeling

Topics 1. Retrieval Models 2. Cross-Lingual information Retrieval 3. Web Search 4. User Modeling 5. Filtering, Topic Detection & Tracking, and classification 6. Summarization 7. Question Answering 8. Metasearch and distributed retrieval 9. Multimedia retrieval 10. Information extraction 11. Testbeds

Topics 1. Retrieval Models 2. Cross-Lingual information Retrieval 3. Web Search 4. User Modeling

Topics 1. Retrieval Models 2. Cross-Lingual information Retrieval 3. Web Search 4. User Modeling 5. Filtering, Topic Detection & Tracking, and classification 6. Summarization 7. Question Answering 8. Metasearch and distributed retrieval 9. Multimedia retrieval 10. Information extraction 11. Testbeds

User Modeling • Much research over the past number of years has abstracted the

User Modeling • Much research over the past number of years has abstracted the user out of the retrieval problem • But, in recent years, the rate of improvement of IR systems has slowed • One reason may be that generic IR systems are “good-enough” for everyone but “never great” for anyone • It is suggested that greater focus on the user will enable major advances in IR

How Do We Get Info About the User?

How Do We Get Info About the User?

How Do We Get Info About the User? • a priori – Ask the

How Do We Get Info About the User? • a priori – Ask the user • a posteriori – Explicit • Show user a document and ask them if it was relevant – Implicit • Track what the user does – Web logs – Time spent reading a page

How Do We Model the User?

How Do We Model the User?

How Do We Model the User? • IR Technique – A vector of terms

How Do We Model the User? • IR Technique – A vector of terms or features supplied by the user or drawn from documents deemed relevant to the user – May be static or adaptive • Machine Learning Technique – An adaptive technique such as a neural net that “learns” the preferences of the user – Feature set selection is important

User Model as Filter Information need Query representation Document representation Matching algorithm User Model

User Model as Filter Information need Query representation Document representation Matching algorithm User Model as Filter results

User Model as Query Information need User Model as Query Document representation Matching algorithm

User Model as Query Information need User Model as Query Document representation Matching algorithm results

Integrating the User Model and the Query User Profile Modified Query Moving the Query

Integrating the User Model and the Query User Profile Modified Query Moving the Query within the Document Space

Integrating the User Model and the Query Document Space p q q'

Integrating the User Model and the Query Document Space p q q'

Integrating the User Profile and the Query Document Space p q

Integrating the User Profile and the Query Document Space p q

Integrating the User Profile and the Query Document Space p q

Integrating the User Profile and the Query Document Space p q

Short-term/Long-term Interests • Users’ interests change over time • May have short-term interests but

Short-term/Long-term Interests • Users’ interests change over time • May have short-term interests but we do not want these to skew our models away from our long-term interests • Particular focus is electronic news

Single task/Multiple tasks • Most user models are built for a specific task, such

Single task/Multiple tasks • Most user models are built for a specific task, such as filtering news items looking for certain types of news • Most people multi-task so we currently run multiple user models for different tasks for the same user • Really would like to have a single model for multiple tasks

Filtering, Topic Detection & Tracking and Classification • Some of these technologies have been

Filtering, Topic Detection & Tracking and Classification • Some of these technologies have been adopted widely • These topics are grouped together because they are similar technologies used in similar applications

Routing of email and phone messages for Customer Relationship Management Service Department Message Routing

Routing of email and phone messages for Customer Relationship Management Service Department Message Routing System New Accounts Customer Complaints

Categorization of Trouble Tickets Trouble Category 1 Trouble Ticket Routing System Trouble Category 2

Categorization of Trouble Tickets Trouble Category 1 Trouble Ticket Routing System Trouble Category 2 Trouble Category 3

Topic Detection Topic 1 Topic 2 News Item Routing System Topic 3 New Topic

Topic Detection Topic 1 Topic 2 News Item Routing System Topic 3 New Topic

Topic Tracking Topic Sub-Topic

Topic Tracking Topic Sub-Topic

Nov ‘ 02 Mar ‘ 03 Sept ‘ 04 Nov ‘ 04 SA n.

Nov ‘ 02 Mar ‘ 03 Sept ‘ 04 Nov ‘ 04 SA n. U ay i te ns so W MD rea ate loc eb a yd err to MD d. W fin of Ira q an d. K on D cti sh Ele Jan ‘ 04 Bu ot nn Ca in ion D as Inv WM for ad ing Ira q inv Topic Tracking

Summarization • Text summarization is an active field of research in both IR and

Summarization • Text summarization is an active field of research in both IR and Natural Language Processing (NLP) • NLP is required for high-quality summarization • IR summarization can provide access to large repositories of data in an efficient way • IR summarization shares some basic techniques with indexing as both are concerned with identifying what a document is “about”

Summarization • A summary can consist of: – A set of keywords or noun

Summarization • A summary can consist of: – A set of keywords or noun phrases – A set of sentences with “important” terms • A summary can be about: – A single document (but not generally) – A set of documents – A web site

Summarization • Each document is represented as a vector and tf. idf is used

Summarization • Each document is represented as a vector and tf. idf is used to determine the best terms • Cluster the documents, create the centroids, and determine the best terms • Sentences are given weights based on occurrence of terms and the associated tf. idf weights

Metasearch and Distributed Retrieval • Retrieving and combining information from multiple sources: – Data

Metasearch and Distributed Retrieval • Retrieving and combining information from multiple sources: – Data fusion • the combination of information from multiple sources that index an effectively common data set – Collection fusion or distributed retrieval • the combination of info from multiple sources that index effectively disjoint data sets

Issues for Metasearch and DR • • • Resource description Resource ranking Resource selection

Issues for Metasearch and DR • • • Resource description Resource ranking Resource selection Searching Merging of results

Major Issue • • • Resource description Resource ranking Resource selection Searching Merging of

Major Issue • • • Resource description Resource ranking Resource selection Searching Merging of results Semantic Interoperability

Summary • IR is no longer the domain of the “specialist” – everyone gets

Summary • IR is no longer the domain of the “specialist” – everyone gets to play • Drowning in information • Next Generation IR tools must be dramatically better than what we have • IR field must rethink its basic assumptions and evaluation methodologies because the ones that brought us to the level of success we have today will not be sufficient to reach the next level

Long-Term Challenges • Global Information Access • Contextual Retrieval

Long-Term Challenges • Global Information Access • Contextual Retrieval