Introduction to Search Engines Search Engine Overview 1

  • Slides: 9
Download presentation
Introduction to Search Engines

Introduction to Search Engines

Search Engine Overview 1 Query (질의) 0 Search Results Search Data (0) (1) Query

Search Engine Overview 1 Query (질의) 0 Search Results Search Data (0) (1) Query Indexing (2) Document Ranking (3) Result Display 1. Document Collection - e. g. , spider/crawler 2. Document Indexing - term indexing (tokenizing, stop & stem) - term weighting User n. What am I looking for? - Identification of info. need What question do I ask? - Query formulation Search Engines 2 3 Searchable Index (색인) Intermediary Information n. What is the searcher looking for? - Discovery of user’s info. need n. How should the question be posed? - Query representation n. Where is the relevant information? - Query-document matching data to collect? - Collection development n. What information to index? - Indexing/Representation n. How to represent it? - Data structure 2

Search Engine: Data § Document Collection Select target data sources – e. g. ,

Search Engine: Data § Document Collection Select target data sources – e. g. , domain, corpus, WWW Harvest data – e. g. , data entry, data import, spider/crawler § Document Indexing Select indexing sources (색인어) – e. g. , metadata, keywords, content Extract indexing terms – e. g. , tokenization, stop & stem Assign term weights – e. g. , tf-idf, okapi “The frequency of word occurrence in an article furnishes a useful measurement of word significance. ” - 문헌에 출현한 던어들은 문헌의 내용 분석을 위해 사용될 수 있으며, 단어의 출현빈도가 이 단어의 주제어로서의 중요성을 측정하는 기준이 된다. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159 -165. Search Engines 3

Search Engine: Indexing Process Documents (Text) INVERTED INDEX Term Weighting Tokenization Token Selection Tokens

Search Engine: Indexing Process Documents (Text) INVERTED INDEX Term Weighting Tokenization Token Selection Tokens Select Tokens D 1 D 2 D 3 D 1 information 1, retrieval 1, seminar 1 wd 1 (information) 1 1 1 D 2 information 1, model 1, retrieval 2 wd 2 (model) 0 1 1 D 3 information 1, model 1 wd 3 (retrieval) 1 2 0 wd 4 (seminar) 1 0 0 Search Engines Tokens SEQUENTIAL Tokens INDEX Token Normalization D 1: Information retrieval seminars D 2: Retrieval Models and Information Retrieval D 3: Information Model D 1: information, retrieval, seminar(s) D 2: retrieval, model(s), and, information, retrieval D 3: information, model 4

Search Engine: Search § Query Indexing Query: What is information retrieval? Q: Information 1,

Search Engine: Search § Query Indexing Query: What is information retrieval? Q: Information 1, retrieval 1 Tokenization Stop & Stem Term Weighting § Document Ranking Query-Document matching Document Score computation § Result Display Index Term D 1 D 2 D 3 wd 1 (information) 1 1 1 wd 2 (model) 0 1 1 wd 3 (retrieval) 1 2 0 wd 4 (seminar) 1 0 0 Rank doc. ID score 1 D 2 3 2 D 1 2 3 D 3 1 Content - e. g. , title & snippets Layout - e. g. , grouped by category Toppings - e. g. , related searches Search Engines 5

2015 8 1 9 2 10 11 3 4 12 5 13 6 14

2015 8 1 9 2 10 11 3 4 12 5 13 6 14 7 Search Engines 6

2015 Result Categories 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 15

2015 Result Categories 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 15 16 Encyclopedia Naver Books Q&A DB (지식i. N) Magazine Café Blog Book Map Website Advertisement (파워링크) 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Image Webpage Naver News Library Video Naver App. Store Naver Scholar Naver Post Naver Shopping News Naver Dictionary 17 ü Proprietary (Naver-specific) content ü Dynamic category order ü Toppings • Search by Category • Related Searches • Popular Searches (by category) 18 Query: 정보검색 (Information Retrieval) 19 Query: 검색엔진 (Search Engine) 20 Search Engines 7

2015 Result Categories 1 1. 2. Webpage Advertisement ü Webpage-centric content ü Dynamic category

2015 Result Categories 1 1. 2. Webpage Advertisement ü Webpage-centric content ü Dynamic category order ü Toppings • Search by Category • Related Searches 2 Query: Information Retrieval Query: Search Engines 8

Search Engine vs. Database vs. Directories Search Engine Database Directories Corpus Type General Specific

Search Engine vs. Database vs. Directories Search Engine Database Directories Corpus Type General Specific General/Specific Data Collection Automatic - crawler/spider Manual - data entry/import - classification Data Quality Not controlled Controlled Data Organization None (bag-of-words) Structured - Relational Structured - Hierarchical Query Input Text box Field-specific - Boolean Text box Search Result Ranked - documents Not ranked - records Ranked - categories Search Index Document text Database Tables Category Tree e. g. Google Library Search dmoz. org Search Engines 9