Ambo University Waliso Branch Course Title Information Retrieval
Ambo University Waliso Branch Course Title: Information Retrieval and Storage 1
Text Collections and IR • Information is organized into (a large number of) documents – Large collections of documents from various sources: books, journal articles, conference papers, newspapers, magazines, digital libraries, Web pages, etc. • Sample Statistics of Text Collections – Google, www. altavista. com, Search Engines offers access to over 3 billion Web documents. – Alta. Vista, www. altavista. com, covers over 250 million Web pages. • It performs more than 40 million search queries each day in more than 25 languages. • Can you explore the size of web pages with Yahoo, Excite, … search engines? 2
Storage of text • Textual documents – Searchable as text – words are represented as Unicode • Image Documents: – Scanned image of text document, which is not searchable as text: Texts (characters, words, etc. ) are represented as patterns of pixels • Retrieval from Document Images: Two options – Recognition-based retrieval: OCR is required to convert document images to ASCII (may be error prone) and then • apply text IR systems on the recognized documents – Recognition-free retrieval: retrieval from document images without explicit recognition. • Search relevant documents directly from image collections. 3
What is Information Retrieval ? • Information retrieval is the process of searching for relevant documents from unstructured large corpus that satisfy users information need. – It is a tool that finds and selects from a collection of items a subset that serves the user’s purpose • Much IR research focuses more specifically on text retrieval. But there are many other interesting areas: üCross-language retrieval, Audio (Speech & Music) retrieval, Question-answering, Image retrieval, Video retrieval. 4
Examples of IR systems • Text-based (Lexis-Nexis, Google, FAST): Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, Web. Seek, Sa. Fe): Search by visual appearance (shapes, colors, … ). • Question answering systems (Ask. Jeeves, Answerbus): Search in (restricted) natural language • Digital and virtual libraries • Other: – Cross language vs. multilingual information retrieval, – Music retrieval – Medical search engines 5
Information Retrieval serve as Bridge • An Information Retrieval System serves as a bridge between the world of authors and the world of readers/users, – That is, writers present a set of ideas in a document using a set of concepts. Then Users seek the IR system for relevant documents that satisfy their information need. User Black box Documents 6
Typical IR System Architecture Document corpus Query String IR System Ranked Relevant Documents 1. Doc 1 2. Doc 2 3. Doc 3. . 7
IR System vs. Web Search System Web Spider Document corpus Query String IR System 1. Page 1 2. Page 2 3. Page 3. . Ranked Relevant Documents 8
The Retrieval Process User Interface User need Text Operations Pre process L o g i c a l User feedback Query Formulation Text Database v i e w Indexing Doc. ID Inverted file Query Searching Retrieved docs Ranked docs Index file Ranking 9
The Retrieval Process • It is necessary to define the text database before any of the retrieval processes are initiated • This is usually done by the manager of the database and includes specifying the following – The documents to be used – The operations to be performed on the text – The text model to be used (the text structure and what elements can be retrieved) • The text operations transform the original documents and the information needs and generate a logical view of them 10
Retrieval Process …. • Once the logical view of the documents is defined, the database module builds an index of the text – An index is a critical data structure – It allows fast searching over large volumes of data • Different index structures might be used , but the most popular one is the inverted file (more on this later) as indicated in the slide • Given the document database is indexed, the retrieval process can be initiated 11
The Retrieval Process … • The user first specifies a user need which is then parsed and transformed by the same text operation applied to the text – Next the query operations is applied before the actual query, which provides a system representation for the user need, is generated • The query is then processed to retrieve documents – Before the retrieved documents are sent to the user, the retrieved documents are ranked according to the likelihood of relevance • The user then examines the set of ranked documents in the search for useful information. Two choices for the user: – (i) reformulate query, run on entire collection or (ii) reformulate query, run on result set • At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle – In such a cycle, the system uses the documents selected by the user to change the query formulation. – Hopefully, this modified query is a better representation of the real user need 12
Issues that arise in IR • Text representation – what makes a “good” representation? – how is a representation generated from text? – what are the retrievable objects and how are they organized? • Information need representation – what is an appropriate query language? – how can interactive query formulation and refinement be supported? • Comparing representations – what is a “good” similarity measure & retrieval model? – how is uncertainty represented? • Evaluating effectiveness of retrieval – what are good metrics? – what constitutes a good experimental test bed? 13
Information Retrieval Research areas • Much of IR research focuses more specifically on text retrieval. But there are many other interesting areas: –Cross-language retrieval, which uses a query in one language (say English) and finds documents in other languages (say Amharic and Russian). –Question-answering IR systems, which retrieve answers from a body of text. For example, the question Who won the 1997 World Series? finds a 1997 headline World Series: Marlins are champions. –Image retrieval, which finds images on a given topic or images that contain a given shape or color. –Video retrieval, which searches for video file that the user looking for. –Audio retrieval, which deals with searching for speech or music file. 14
Subareas, Applications, Methods • • • • • Graphical interfaces to support information search Information Retrieval & Extraction XML retrieval Geographic Information Retrieval Multimedia information retrieval Cross-Language & Multilingual Information Retrieval Agent-based (like information filtering, tracking, routing) Information Retrieval Adversarial Information Retrieval Question answering Document Summarization Text classification Multi-database searching Document provenance Recommender systems Information Retrieval & Machine Learning Text Mining & Web Mining N-Grams in Information Retrieval … 15
- Slides: 15