Alexander Gelbukh Moscow Russia 1 Mexico 2 Computing
Alexander Gelbukh Moscow, Russia 1
Mexico 2
Computing Research Center (CIC), Mexico 3
Chung-Ang University, Korea Electronic Commerce and Internet Application Lab 4
Special Topics in Computer Science The Art of Information Retrieval Alexander Gelbukh www. Gelbukh. com 5
Information Retrieval q q q In a huge amount of poorly structured information find the information that you need when you don’t know exactly what you need or can’t explain it q The Web q User information need q Ranking 6
7
8
Information Retrieval q q q In a huge amount of poorly structured information find the information that you need when you don’t know exactly what you need or can’t explain it q The Web q User information need q Ranking 9
Importance q Knowledge: the main treasure of man q Web: Repository? Cemetery of information! q Natural language and multimedia information o Poorly structured, badly written q Corporate and organizational document bases o Senate speeches: Mexico o Medical data collections o Corporate memory. Microsoft knowledge base q Future: data explosion increasing importance 10
Perspectives q Corporations: corporate databases q Organizations: document bases q Government o European Union multilingual problem o The same in Asia q Academy o o Lots of open research topics Web topics Computational Linguistics topics Intelligent technologies, AI 11
Textbook http: //sunsite. dcc. uchile. cl/irbook/ 12
Contents 1. Introduction 2. Modeling 3. Retrieval Evaluation 4. Query Languages 5. Query Operations 6. Text and Multimedia Languages and Properties 7. Text Operations 8. Indexing and Searching 9. Parallel and Distributed IR 10. User Interfaces and Visualization 11. Multimedia IR: Models and Languages 12. Multimedia IR: Indexing and Searching 13. Searching the Web 14. Libraries and Bibliographical Systems 15. Digital Libraries 13
Calendar 1. 2. 3. 4. 5. September 18 Chapter 1 Introduction September 25 Chapter 2 Modeling October 2 Chapter 3 Retrieval Evaluation October 9 Chapter 4 Query Languages October 16 Chapter 5 Query Operations October 23 – midterm exam 6. October 30 Chapter 6 Text and Multimedia Languages. . . 7. November 6 Chapter 7 Text Operations 8. November 13 Chapter 8 Indexing and Searching 9. November 20 Chapter 10 User Interfaces and Visualization 10. November 27 Chapter 13 Searching the Web 11. December 4 Chapter 14 Libraries and Bibliographical Systems 12. December 11 Chapter 15 Digital Libraries December – final exam 14
Class structure Main course: Information Retrieval q Discussion of previous chapter. Questions q I briefly present a new chapter Research seminar: Natural Language Processing q Discussion of previous paper. Questions. o Identification of possible research topics q Presentation of a new paper or current work q Discussion and questions q Goal: publications! 15
Natural Language Processing Research Seminar 16
What CL is about Computers to process natural language text q “Understand” q Generate q Search q Organize q Translate q … Useful in IR 17
Methods q No: text as a stream of letters o Brute force statistics o Simplified heuristics (ex. : Porter) q Yes: attention to language rules o Linguistically motivated approaches o Knowledge-based approaches o Corpus-based approaches 18
What IR is about q Classical IR: find words? Concepts! q Question answering q Summarization q Clustering q… Take language seriously 19
Text representations for IR q Represent the retrieval features o Strings → stems (lexemes), synsets, phrases. o Women → woman, lady, female o Old men and women → old woman q Structured representation of text o Network of related events and entities o Enables logical inference 20
CL tasks useful in IR q Morphology (stemming) q POS / Word dense disambiguation q Word relatedness q Anaphora resolution q Parsing and semantics (phrase search) q Synonymic rephrasing q Translation etc… Each one a whole science in itself 21
Morphology q Q: pig T: piggish q Simple: stemming o piggish → pig- q Lexeme: set of word forms o same stem can give different words o pigment → not pig; piny → pine, not pin q Dictionary/corpus-based methods o Learning; dictionary management 22
Part of Speech Disambiguation q Q: oil well T: He did very well q Q: what is an are? T: They are nice q Important for English, Chinese. Less important for other types q Perhaps not so helpful directly, but is necessary for most other tasks q Usually statistical / heuristic methods 23
Word Sense Disambiguation q Q: bank account T: on the beautiful banks of Han river. . . q bill: document, banknote, law, ax, peak, Gates. . . q Very frequent, almost any word in text q Statistical & dictionary methods q International competitions 24
Word relatedness q Q: female T: woman (women) o Synonyms. Subtypes/super-types o Dictionaries. Word. Net. Similarity. Lesk. q Q: Korea T: Seoul o Other linguistic relationships (e. g. , part) o Real-world relationships (facts) q Q: Clinton T: Lewinsky o Statistical co-occurrence (MI) 25
Anaphora resolution q Q: Awards of Prof. Han T: Prof. Han said. . . He did. . . IBM awarded him. . . o Frequency o Phrases, co-occurrence, summarization, inference, translation q Heuristic (Mitkov) and knowledgebased methods q Other types of co-reference 26
Parsing, semantics q Q: Awards of Prof. Han T 1: Prof. Han among many other prizes has several IBM awards T 2: Mr. Kang has an award Prof. Han does not know of q Understanding of text o Rich structured representation q Better phrase search; question answering, summarization, . . . 27
Synonymic rephrasing, reasoning q Q: experienced computer scientists T: Prof. Han has been programming for many years and awarded an IBM award q Requires good syntactic and semantic analysis q Knowledge-based methods 28
Multilingual access q Q: 요구르트 T: We sell excellent yoghurt. Продаем йогурт. Se vende rico yogur. o Search multilingual collections § Europe: dozens of official languages of EU o If you don’t know how to say it in English q Dictionaries, bilingual corpora, . . . 29
Tasks are entangled q Many of CL tasks require other tasks o Morphology → syntax → semantics q Many CL tasks form circles o parsing ← WSD ← parsing o I see a wild cat with a telescope (tripod? ) q Can be done quick-and-dirty (? ) o Fighting for last %s o Zipf law: 20% of men drink 80% of beer 30
Tools and infrastructure q Analysis tools o Tasks, methods q Dictionaries and grammars o Types, structure o Automatic acquisition q Corpora o Corpora analysis tools and methods 31
Possible tasks q WSD to help IR q Clustering + summarization in IR results q Anaphora and coreference resolution to help IR q Multilingual IR q Applications to Korean q. . . a lot of others 32
Reading q Textbooks o Manning & Schütze, Allen, Jurafsky, Hausser, . . . q CICLing proceedings q Computational Linguistics q Google, Research. Index 33
Questions q Who expects to publish? q Who will make a presentation at the next seminar? 34
Thank you! Till September 18 35
- Slides: 35