Information Retrieval IR is finding material usually documents

  • Slides: 13
Download presentation
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Information Retrieval Models • A model is an abstract representation of a process or

Information Retrieval Models • A model is an abstract representation of a process or object – Used to study properties, draw conclusions, make predictions – The quality of the conclusions depends upon how closely the model represents reality

Exact Match • Query specifies precise retrieval criteria • Every document either matches or

Exact Match • Query specifies precise retrieval criteria • Every document either matches or fails to match query • Result is a set of documents – Usually in no particular order – Often in reverse-chronological order

Best Match • Query describes retrieval criteria for desired documents • Every document matches

Best Match • Query describes retrieval criteria for desired documents • Every document matches a query to some degree • Result is a ranked list of documents, “best” first

Information Retrieval Models • Information retrieval models can be classified into : 1. Boolean

Information Retrieval Models • Information retrieval models can be classified into : 1. Boolean model (Exact Match) 2. Vector Space model (Best Match) 3. Probabilistic model (Best Match)

Boolean model The model can be explained by thinking of a query term as

Boolean model The model can be explained by thinking of a query term as a unambiguous definition of a set of documents. • Documents are sets of terms • Queries are Boolean expressions on terms. • Queries are index terms linked by AND, OR, or NOT. • It is an exact match model, which implies that a document is retrieved if and only if it matches the description of the query term set.

Example 1: Boolean model D 1= “computer information retrieval” D 2= “computer retrieval” D

Example 1: Boolean model D 1= “computer information retrieval” D 2= “computer retrieval” D 3= “information” D 4= “computer information” Q 1= “information AND retrieval” Q 2 = “information BUT NOT Computer”. • Answer: • Q 1= “information AND retrieval” D 1 • Q 2 = “information BUT NOT Computer” D 3

Example 2: Boolean model • Doc 1: “Computers have brought the world to our

Example 2: Boolean model • Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look a new at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence. • • Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …” •

Example 2: Boolean model • Query: (principles AND knowledge) OR (science AND engineering) Doc

Example 2: Boolean model • Query: (principles AND knowledge) OR (science AND engineering) Doc 1 : 0 1 1 0 FALSE Doc 2 : 1 0 1 1 TRUE

Example 3: Boolean model • Query: (principles OR knowledge) AND (science OR NOT engineering)

Example 3: Boolean model • Query: (principles OR knowledge) AND (science OR NOT engineering) • Doc 1 : 0 1 1 0 TRUE

Example 4: INDEX • The matrix below represent whether a certain word occurs (1)

Example 4: INDEX • The matrix below represent whether a certain word occurs (1) or does not occur (0) in agiven document. d 1 d 2 d 3 d 4 d 5 d 6. . . Antony 1 1 0 0 0 1 Brutus 1 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 Hence, the documents that contain “Brutus” and “Caesar” but do not contain “Calpurnia”are: • 110100 and 110111 and 101111 = 100100 in words, d 1, d 4.

Problems with index Usually IR is done from a very large document collection (or

Problems with index Usually IR is done from a very large document collection (or “corpus”). For instance, assume we have: • _ 1 million documents, • _ each document is about 1, 000 words (2 -3 book pages), • _ each word is about 6 bytes. • _ Then, the document collection is about 6 gigabytes (GB) size. • _ With around 500, 000 distinct terms

Problems with index • The term-document matrix would be too big: 500 K *

Problems with index • The term-document matrix would be too big: 500 K * 1 M matrix has half-a-trillion 0’s and 1’s. • They would not fit in a computer’s memory. • The 0’s could be many (sparse data). It might be better to record only the things that do occur, that is, the 1’s. This is the idea behind “inverted index”.