THE BASIC OF INFORMATION RETRIEVAL CS 4323 0910
THE BASIC OF INFORMATION RETRIEVAL CS 4323 / 0910 -2 YFA Tersedia online di http: //www. ittelkom. ac. id/staf/yanuar 02 YFA CS 4323 S 1/IT/IR/E 5/0910 Institut Teknologi Telkom http: //www. ittelkom. ac. id/staf/yanuar
Searching and Browsing: The Human in the Loop Return objects Return hits Browse repository Search index http: //www. ittelkom. ac. id/staf/yanuar
Examples of Search Systems Find file on a computer system (Spotlight for Macintosh). Library catalog for searching bibliographic records about books and other objects (Library of Congress catalog). Abstracting and indexing system for finding research information about specific topics (Medline for medical information). Web search service for finding web pages (Google). http: //www. ittelkom. ac. id/staf/yanuar
http: //www. ittelkom. ac. id/staf/yanuar
Evaluation To place information retrieval on a systematic basis, we need repeatable criteria to evaluate how effective a system is in meeting the information needs of the user of the system. This proves to be very difficult with a human in the loop. It proves hard to define: • the task that the human is attempting • the criteria to measure success. http: //www. ittelkom. ac. id/staf/yanuar
Information Discovery: Examples and Measures of Success People have many reasons to look for information? http: //www. ittelkom. ac. id/staf/yanuar
Information Discovery: Examples and Measures of Success People have many reasons to look for information: • Known item Where will I find the wording of the US Copyright Act? Success: A document from a reliable source that has the current wording of the act. • Fact What is the capital of Barbados? Success: The name of the capital from an up to date reliable source. http: //www. ittelkom. ac. id/staf/yanuar
Information Discovery: Examples and Measures of Success (continued) People have many reasons to look for information: • Introduction or overview How do diesel engines work? Success: A document that is technically correct, of the appropriate length and technical depth for the audience. • Related information (annotation) Is there a review of this item? Success: A review, if one exists, written by a competent author. http: //www. ittelkom. ac. id/staf/yanuar
Information Discovery: Examples and Measures of Success (continued) People have many reasons to look for information: • Comprehensive search What is known of the effects of global warming on hurricanes? Success: A list of all research papers on this topic. Historically, comprehensive search was the application that motivated information retrieval. It is important in such areas as medicine, law, and academic research. The standard methods for evaluating search services are appropriate only for comprehensive search. http: //www. ittelkom. ac. id/staf/yanuar
Indexes Search systems rarely search document collections directly. Instead an index is built of the documents in the collection and the user searches the index. Document collection User Create index Search index Index http: //www. ittelkom. ac. id/staf/yanuar Documents can be digital (e. g. , web pages) or physical (e. g. , books)
Automatic Indexing The aim of automatic indexing is to build indexes and retrieve information without human intervention. When the information that is being searched is text, methods of automatic indexing can be very effective. http: //www. ittelkom. ac. id/staf/yanuar
Descriptive Metadata Some methods of information retrieval search descriptive metadata about the objects. Metadata typically consists of a catalog or indexing record, or an abstract, one record for each object. The record acts as a surrogate for the object. • Usually the metadata is stored separately from the objects that it describes, but sometimes is embedded in the objects. • Usually the metadata is a set of text fields. Textual metadata can be used to describe non-textual objects, e. g. , software, images, music http: //www. ittelkom. ac. id/staf/yanuar
Descriptive Metadata Catalog: metadata records that have a consistent structure, organized according to systematic rules. (Example: Library of Congress Catalog) Abstract: a free text record that summarizes a longer document. Indexing record: less formal than a catalog record, but more structure than a simple abstract. (Example: Inspec, Medline) http: //www. ittelkom. ac. id/staf/yanuar
Documents and Surrogates The sea is calm to-night. The tide is full, the moon lies fair Upon the straits; --on the French coast the light Gleams and is gone; the cliffs of England stand, Glimmering and vast, out in the tranquil bay. Come to the window, sweet is the night-air! Only, from the long line of spray Where the sea meets the moon-blanch'd land, Listen! you hear the grating roar Of pebbles which the waves draw back, and fling, At their return, up the high strand, Begin, and cease, and then again begin, With tremulous cadence slow, and bring The eternal note of sadness in. Document Author: Matthew Arnold Title: Dover Beach Genre: Poem Date: 1851 Surrogate (catalog record) Notes: 1. The surrogate is also a document 2. Every word is different! http: //www. ittelkom. ac. id/staf/yanuar
Library of Congress Catalog Record (part) CREATED/PUBLISHED: [between 1925 and 1930? ] SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on. NOTES: Title supplied by cataloger. Source: Morey Engle. SUBJECTS: Coolidge, Calvin, --1872 -1933. Presidents--United States--1920 -1930. Autographing--Colorado--Denver--1920 -1930. Denver (Colo. )--1920 -1930. Photographic prints. MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in. ) http: //www. ittelkom. ac. id/staf/yanuar
Surrogates for Non-Textual Materials Topik Tugas Akhir Textual catalog record about a non-textual item (photograph) Surrogate Text based methods of information retrieval can search a surrogate for a photograph http: //www. ittelkom. ac. id/staf/yanuar
Kelompok IR • Buatlah kelompok/tim dengan jumlah anggota 4 atau 5 orang. • Tentukan ketua kelompok/tim. • Setiap kelompok, setidaknya tersedia satu notebook/netbook untuk proses presentasi. • Anggota kelompok yang dibentuk harus tetap untuk setiap tahapan tugas berkelompok dalam mata kuliah ini. http: //www. ittelkom. ac. id/staf/yanuar
Tugas Kelompok & Presentasi I • Carilah referensi tentang Visual Information Retrieval. • Buatlah slide presentasi tentang topik tersebut. • Slide presentasi berupa poin-poin, dengan konten minimal 5 slide dan maksimal 10 slide. • Slide presentasi diprint & dikumpulkan hardcopy, selanjutnya akan dipresentasikan pada diskusi kelas hari – Senin, 22 Februari 2010 (kelas reg). http: //www. ittelkom. ac. id/staf/yanuar
Structured vs Unstructured Data Topik Tugas Akhir • Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e. g. , Salary < 60000 AND Manager = Smith. http: //www. ittelkom. ac. id/staf/yanuar
Unstructured Data Topik Tugas Akhir • Typically refers to free text • Allows – Keyword queries including operators – More sophisticated “concept” queries e. g. , • find all web pages dealing with drug abuse • Classic model for searching text documents • Structured data has been the big commercial success [think, Oracle…] but unstructured data is now becoming dominant in a large and increasing range of activities [think, email, the web] http: //www. ittelkom. ac. id/staf/yanuar
Discussion Class: Document Sets Each of the three information retrieval systems indexes a set of documents: (a) What documents does Google index? (b) What documents does the Library of Congress catalog index? (c) What documents does Medline index? In each case are you searching surrogates or the full text of documents? http: //www. ittelkom. ac. id/staf/yanuar
Discussion Class: Fielded Searching For each of the search systems: (a) Is it full text or fielded searching? (b) What fields can you search on? How was the fielded information created (by author, by professional cataloger/indexer, by algorithm, etc. )? http: //www. ittelkom. ac. id/staf/yanuar
Discussion Class: Language Support Which of the following does each system support? (a) stop lists (b) stemming (c) thesaurus http: //www. ittelkom. ac. id/staf/yanuar
Discussion Class: Language Support Which of the following does each system support? (a) stop lists (ignore common words) (b) stemming (compute, computer and computing) (c) thesaurus (heart attack, cardiac arrest) http: //www. ittelkom. ac. id/staf/yanuar
Discussion Class: Search Options (a) What Boolean operators does each system offer (and, or, not, etc. )? (b) What special symbols does each system allow in queries as wild cards, etc. (e. g. , ? as a truncation symbol)? (c) What is meant by a "search limit"? http: //www. ittelkom. ac. id/staf/yanuar
Discussion Class: Browsing (a) What support does Google give for browsing? (b) What support does the Library of Congress Catalog give for browsing? (c) What support does Medline give for browsing? http: //www. ittelkom. ac. id/staf/yanuar
Discussion Class: Red Wine What is the medical evidence that red wine is good or bad for your health? (a) How does Google help a user answer this question? (b) How does Medline help a user answer this question? Under what circumstance would you use each system to find information such as this? http: //www. ittelkom. ac. id/staf/yanuar
Quizand. Game
Puzzle
Answer: Puzzle M
Puzzle
Answer: Puzzle B
Empowering Analysis
YFA August 2008 (2 nd Edition), February 2008 http: //www. ittelkom. ac. id/staf/yanuar Adapted from cs. cornell. edu http: //www. ittelkom. ac. id/staf/yanuar
- Slides: 34