Information Retrieval Tutorial Outline What is Information Retrieval

Outline § § What is Information Retrieval (IR)? Overview of Core IR Technology Overall

What is IR? Traditional IR: Willow System 3

IR & the Rest of the World DB Natural Language Processing AI Statistics Human

Evaluation of IR Systems § effectiveness “relevance” Ret NOT Ret Rel A B NOT

Overview of Text Retrieval Text Processing Raw text User/System Interaction Knowledge Resources & Tools

Text Processing (1) - Indexing § Extraction of index terms and computation of their

Text Processing (2) – Storing indexing results AB E AC F 1 2 C

Text Processing (3) - Indexing § Use of various linguistic resources ¨ Dictionaries (noun,

User/System Interaction – Query Models § Boolean ¨ AND, OR, NOT operators = E.

User/System Interaction – Query Models § “Natural Language” Query E. g. : “I want

Ask Jeeves화면 16 Copyright © 2004 Sung Hyon Myaeng

User/System Interaction – Query Models § Relevance feedback “Similar Pages” in Web search engines

User/System Interaction – Result Presentation § Information overload problem – too many retrieved §

Text Retrieval Overview Text Processing raw text User/System Interaction Knowledge Resources & Tools Info

Matching & Ranking (1) § Inverted File, … 1 2 5 4 6. .

Matching & Ranking (2) § Ranking ¨ Retrieval Model = Boolean (exact) => Fuzzy

IR Model Example: Vector Space Model <DOC 1>. . . cat. . . .

Matching & Ranking (3) § Techniques for efficiency New storage structure esp. for new

Web document retrieval – using hyperlinks Initial Retrieval Set A TERM B To be

Characteristics of IR - summary Unstructured vs Structured Information Retrieval Probabilistic Derived from contents

Slides: 31

Download presentation

Information Retrieval Tutorial

Outline § § What is Information Retrieval (IR)? Overview of Core IR Technology Overall Directions IR Expanded ¨ ¨ ¨ ¨ CLIR/MLIR Classification Topic Detection & Tracking Recommender Systems Summarization Question Answering Information Extraction 2

What is IR? Traditional IR: Willow System 3

What is IR? Google Web Search Engine 4

What is IR? Ask Jeeves 5

IR & the Rest of the World DB Natural Language Processing AI Statistics Human Computer Interaction Linguistics Library & Info Science Computer Science Information Retrieval Cognitive Science 6

Evaluation of IR Systems § effectiveness “relevance” Ret NOT Ret Rel A B NOT Rel C D precision: A / A+C ¨ recall: A / A+B efficiency Interactive systems? Others? precision ¨ ¨ § § § recall 7

Overview of Text Retrieval Text Processing Raw text User/System Interaction Knowledge Resources & Tools Info Needs Analysis of Info Needs Text Analysis Search Engine Index Matching (Inferencing) Query Retrieval Result 8

Text Processing (1) - Indexing § Extraction of index terms and computation of their weights § Index terms: represent document content & separate documents ¨ “economy” vs “computer” in a news article of Financial Times § Morphological Analysis (stemming in English) “벨기에는” (“벨기+”에는”? ), “문서내의” (“문서”+”내의”) ¨ “information”, “informed”, “informs”, “informative” ¨ Rule-based vs dictionary-based ¨ § n-gram “정보검색시스템” => “_정”, “정보”, “보검”, “검색”, … (bi-gram) ¨ Surprisingly effective in some languages ¨ 9

Text Processing (2) – Storing indexing results AB E AC F 1 2 C AD G F 3 B G 4 1 2 3 4 … A v v v B v C v v D v E v F v G v n n v v Inverted index 10

Text Processing (3) - Indexing § Use of various linguistic resources ¨ Dictionaries (noun, Josa, Eomi, bilingual, Proper noun, foreign words, …) = ¨ For extraction and weighting of index terms Thesaurus (e. g. Word. Net) 시소러스(thesaurus)는 동의어, 반의어 사전이다. Controlled vocabulary indexing = Matching similar and related words = ¨ Tagged Corpus => ¨ 한국어로는 '말뭉치' 혹은 '말모둠'으로 번역하는, 코퍼스(corpus)는 글 또는 말 텍스트를 모아 놓은 것이다. § Most NLP technology is used for term extraction “Bag of words” approach ¨ Sense disambiguation? ¨ Word order? ¨ 11

Overview of Text Retrieval Text Processing raw text User/System Interaction Knowledge Resources & Tools Info Needs Analysis of Info Needs text 분석 Search Engine Index Matching (Inferencing) Query Retrieval Result 13

User/System Interaction – Query Models § Boolean ¨ AND, OR, NOT operators = E. g (semi-conductor OR chip) AND stock NOT chocolate) adjacency, phrase operators = E. g: “stock exchange”, “그리고 아무 말도 하지 않았다”) ¨ Difficult for naïve users visual query interface ¨ § Word list ¨ Vector space model system = E. g. : ¨ (semi-conductor chip stock) Often interpreted as a Boolean query in search engines = E. g. (semi-conductor OR chip OR stock) 14

User/System Interaction – Query Models § “Natural Language” Query E. g. : “I want to get information about ski resorts in Kangwon-do or in the Chungcheong area. ” ¨ Limitations in NLP Various tricks Query Expansion ¨ To resolve mismatches between query terms and index terms for documents ¨ A variety of linguistic resources are used (e. g. synonym, foreign word equivalence classes, bilingual dictionaries) Guide users to follow step-by-step instructions for detailed queries ¨ “canned queries” (E. g. : “Ask Jeeves”) ¨ query templates ¨ § § 15

User/System Interaction – Query Models § Relevance feedback “Similar Pages” in Web search engines ¨ From a simple query to better queries progressively ¨ = Limited recall capability of human beings = Recognition of a relevant document is much easier. = Intended to ease the difficulty of grasping the statistical properties of the entire collection ¨ An indirect way of capturing the user needs § User profile To reflect user’s interest and orientation in interpreting user queries ¨ Need to gather & analyze user log data and learn user models ¨ 17

User/System Interaction – Result Presentation § Information overload problem – too many retrieved § A simple ranked list - title, author, URL, date, … § Method 1: Organizing the retrieved documents Result Clustering (E. g. Vivisimo) ¨ “Zoom-in” operation (E. g: Scatter & Gather) ¨ § Method 2: Visualizing the retrieved documents Overview of a large amount of information ¨ Visual expression of document properties ¨ E. g. Tile. Bar ¨ 18

Scatter/Gather 19

Tile Bar 20

Result Clustering 21

Text Retrieval Overview Text Processing raw text User/System Interaction Knowledge Resources & Tools Info Needs Analysis of Info Needs text Analysis Search Engine Index Matching (Inferencing) Query Retrieval Result 22

Matching & Ranking (1) § Inverted File, … 1 2 5 4 6. . . 3 5. . Doc #1 ------- 1011 1012 1 4 Doc #2 ------- . . . Directory 1 2 3 4 5. . . 275 276. . . . Query Terms Wt Pointers 가구 0. 7 3 가야 0. 9. . 신라 0. 9 2. . . . 호랑이 0. 6 2 Posting File은 문서 내의 색인어와 색인어의 위치 정보-문장번호, 어절 번호 등-로 구성한 문서별 색인어 역파일(inverted file)이다. Doc #5 ------- Posting file 23

Matching & Ranking (2) § Ranking ¨ Retrieval Model = Boolean (exact) => Fuzzy Set (inexact) = Vector Space = Probabilistic = Inference Net =… ¨ Weighting Schemes = Index terms, query terms = Parameters in formulas = Document characteristics =… 24

IR Model Example: Vector Space Model <DOC 1>. . . cat. . . . dog. . . . mouse. . . dog D 1 Q = < cat, mouse, 0 > mouse Di = (di 1, di 2, . . . , din) Q = (q 1, q 2, . . . , qn) Similarity = Di. Q / |Di|*|Q| cat Q 25

TF*IDF

Similarity of documents

Similarity of documents for Query

Matching & Ranking (3) § Techniques for efficiency New storage structure esp. for new document types ¨ Use of accumulators for efficient generation of ranked output ¨ Compression/decompression of indexes ¨ § Technique for Web search engines ¨ Use of hyperlinks Inlinks & outlinks = Authority vs hub pages = In conjunction with Directory Services (e. g. Yahoo) ¨ Softbot – storing terabytes of data and efficient crawling ¨. . . ¨ 29

Web document retrieval – using hyperlinks Initial Retrieval Set A TERM B To be ranked again using the link information C Candidates for additional retrieval A: Hub document B: Authority document Increase the weight of A, B 30

Characteristics of IR - summary Unstructured vs Structured Information Retrieval Probabilistic Derived from contents Partial or “Best” Match Natural Language Relevance Ranked Data Retrieval Models Indexing Matching/Retrieval Query Types Results Criteria Results Ordering Deterministic Complete Items Exact Match Structured Any Match Arbitrary Information Retrieval/Data Retrieval Spectrum 31