Overview of IR Research Cheng Xiang Zhai Department

What is Information Retrieval (IR)? • Salton’s definition (Salton 68): “information retrieval is a

Who are working on IR? (IR and Related Areas) Applications Models Statistics Optimization Machine

IR and NLP • • • The two fields were closely related from day

IR and Databases • “Sibling” fields, but they didn’t get along with each other

IR and Machine Learning • IR as a subfield of AI (IR=intelligent text access)?

IR and Library & Information Science • Inseparable from day one (“Information Science” vs.

IR and Software Engineering • Scalability of IR wasn’t a major concern until the

IR and Applications • Early days: library search, literature • 1970 s: small-scale online

Publications/Societies (broad view) Learning/Mining ICML Applications ISMB WWW ICML, NIPS, UAI ACM SIGKDD Statistics

Major IR Publication Venues <1960 1970 1990 1980 2000 1978 ACM SIGIR CIKM 1994

IR Research Topics (Broad View) Users Retrieval Applications Visualization Summarization Filtering Information Access Analytics

IR Topics (narrow view) docs 4. Efficiency & scalability INDEXING Query 3. Document Doc

Major Research Milestones • Early days (late 1950 s to 1960 s): foundation and

Frontier Topics in IR: Overview • Two types of topics – 30%: Fundamental challenges:

Topics in SIGIR 2011/2012 CFP • Document Representation and Content Analysis (e. g. ,

My View of the Future of IR Task Support Full-Fledged Text Mining Info. Management

What You Should Know • IR is a highly interdisciplinary area interacting with many

Slides: 18

Download presentation

Overview of IR Research Cheng. Xiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

What is Information Retrieval (IR)? • Salton’s definition (Salton 68): “information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information” – Information: mostly text, but can be anything (e. g. , multimedia) – Retrieval: • Narrow sense: search/querying • Broad sense: filtering, classification, summarization, . . . • In more general terms – Information access – Information seeking – Help people manage and make use of all kinds of information

Who are working on IR? (IR and Related Areas) Applications Models Statistics Optimization Machine Learning Pattern Recognition Data Mining Computer Vision Natural Language Processing Algorithms Applications Web, Bioinformatics… r p om C an ion m t Hu erac Int Information Retrieval ute Library & Info Science Databases Software engineering Computer systems Systems

IR and NLP • • • The two fields were closely related from day one, but somewhat disconnected later when NLP focused more on cognitive and symbolic approaches, while IR focused more on pure statistical approaches Most recently the two fields regained close interactions – More complex retrieval tasks (question answering, opinons) – More scalable/robust NLP techniques (parsing, extraction) IR researchers pioneered statistical approaches to NLP in 1950’s (e. g. , H. P. Luhn), which only became popular in 1990’s among NLP researchers

IR and Databases • “Sibling” fields, but they didn’t get along with each other well • IR and DB share many common tasks, but the differences in the form of data and nature of queries are large enough to separate the two fields in most of the history • Major differences in data, user, query, what counts as answers: DB efficiency; IR effectiveness • The two fields are now getting closer and closer now (DB researchers realized the importance of 80% unstructured data, and IR researchers realized the importance of semantic search)

IR and Machine Learning • IR as a subfield of AI (IR=intelligent text access)? – AI is too big to have a coherent community (e. g. , ML, NLP, Computer Vision all “spin off”) • IR researchers did machine learning as early as in 1960’s (Rocchio 1965, relevance feedback), but supervised learning didn’t get popular in IR until in early 1990’s when text categorization started getting a lot of attention – Lack of training data for search (no large-scale online system, users don’t like to make effort on judgments) – Learning-based approach didn’t prevail for ad hoc retrieval • Machine learning is now very important for IR

IR and Library & Information Science • Inseparable from day one (“Information Science” vs. “Computer Science”) • Early IR work was mostly done in the context of library and information science (LIS) • I-School initiative/movement: drop “library” and enlarge the scope to “informatics”, leading to merger of CS + LIS • Another example where the boundary between fields is disappearing (setting boundaries is generally harmful for research, but is sometimes needed in practice)

IR and Software Engineering • Scalability of IR wasn’t a major concern until the Web – Data collection was relatively small and didn’t grow quickly until the Web – The most effective retrieval models remain simple models based on bag-of-words representation • However, scalability has always been a core issue in IR, and how to engineer an IR system optimally is extremely important for IR applications • Nowadays, data-intensive computing is essential for large-scale IR applications

IR and Applications • Early days: library search, literature • 1970 s: small-scale online search systems • 1990 s: large-scale systems – TREC (mostly news data, later other kinds of data) – Web search engines • 2010 s: search is everywhere! • More and more applications in the future

Publications/Societies (broad view) Learning/Mining ICML Applications ISMB WWW ICML, NIPS, UAI ACM SIGKDD Statistics ICDM, SDM AAAI NLP ACL HLT COLING, EMNLP, NAACL RECOMB, PSB WSDM Info Retrieval ACM SIGIR Info. Science JCDL ECIR, CIKM, TREC TOIS, IRJ, IPM OSDI Software/systems JASIS Databases ACM SIGMOD, VLDB ICDE, EDBT, TODS

Major IR Publication Venues <1960 1970 1990 1980 2000 1978 ACM SIGIR CIKM 1994 ECIR 1978 WWW 1994 WSDM 2008 TREC 1992 IMP(ISR) ACM TOIS 1983 1965 JASIST 1950 JDoc 1945 IRJ 1998 2010

IR Research Topics (Broad View) Users Retrieval Applications Visualization Summarization Filtering Information Access Analytics Applications Mining Information Organization Search Categorization Extraction Clustering Natural Language Content Analysis Text Acquisition Text Mining

IR Topics (narrow view) docs 4. Efficiency & scalability INDEXING Query 3. Document Doc Rep representation/structure Rep SEARCHING Ranking Models 2. Retrieval (Ranking) Feedback 7. Feedback/Learning query 6. User interface (browsing) User 1. Evaluation results 5. Search result INTERFACE summarization/presentation judgments QUERY MODIFICATION LEARNING Topics covered most in this course: 2, 3, 5, 7

Major Research Milestones • Early days (late 1950 s to 1960 s): foundation and founding of the field – Luhn’s work on automatic encoding Indexing: auto vs. manual – Cleverdon’s Cranfield evaluation methodology and index experiments – Salton’s early work on SMART system and experiments • 1970 s-1980 s: a large number of retrieval models – Vector space model – Probabilistic models • Evaluation System Indexing + Search Theory 1990 s: further development of retrieval models and new tasks – Language models Large-scale evaluation, beyond ad hoc retrieval – TREC evaluation • 2000 s-present: more applications, especially Web search and interactions with other fields – Web search – Learning to rank – Scalability (e. g. , Map. Reduce) Web search Machine learning Scalability

Frontier Topics in IR: Overview • Two types of topics – 30%: Fundamental challenges: IR models, evaluation, efficiency, user models/studies – 70%: Application-driven challenges: Web (1. 0, 2. 0, 3. 0? ), Enterprise (text analytics), Scientific Research (bioinformatics, …) • Methodology – 50%: Machine learning (feature set + supervised) – 30%: Language models (unigram + unsupervised) – 20%: Others (user studies, empirical experiments) • Trends – More interdisciplinary and internationalized – More diversification of topics (new applications, new methods) – Hard fundamental problems regularly revisited 15

Topics in SIGIR 2011/2012 CFP • Document Representation and Content Analysis (e. g. , text representation, document structure, linguistic analysis, non-English IR, cross-lingual IR, information extraction, sentiment analysis, clustering, classification, topic models, facets) • Queries and Query Analysis (e. g. , query representation, query intent, query log analysis, question answering, query suggestion, query reformulation) • Users and Interactive IR (e. g. , user models, user studies, user feedback, search interface, summarization, task models, personalized search) • Retrieval Models and Ranking (e. g. , IR theory, language models, probabilistic retrieval models, feature-based models, learning to rank, combining searches, diversity) • Search Engine Architectures and Scalability ( e. g. , indexing, compression, Map. Reduce, distributed IR, P 2 P IR, mobile devices) • Filtering and Recommending (e. g. , content-based filtering, collaborative filtering, recommender systems, profiles) • Evaluation (e. g. , test collections, effectiveness measures, experimental design) • Web IR and Social Media Search (e. g. , link analysis, query logs, social tagging, social network analysis, advertising and search, blog search, forum search, CQA, adversarial IR, vertical and local search) • IR and Structured Data (e. g. , XML search, ranking in databases, desktop search, entity search) • Multimedia IR (e. g. , Image search, video search, speech/audio search, music IR) 16 • Other Applications (e. g. , digital libraries, enterprise search, genomics IR, legal IR, patent search, text reuse)

My View of the Future of IR Task Support Full-Fledged Text Mining Info. Management Access Search Current Search Engine Keyword Queries Search History Personalization Complete User Model (User Modeling) Bag of words Entities-Relations Large-Scale Knowledge Semantic Analysis Representation 17

What You Should Know • IR is a highly interdisciplinary area interacting with many other areas, especially NLP, ML, DB, HCI, software systems, and Information Science • Major publication venues, especially ACM SIGIR, ACM CIKM, ACM TOIS, IRJ, IPM, WSDM