Quad Search A novel metasearch engine http cheetah

  • Slides: 26
Download presentation
Quad Search: A novel metasearch engine (http: //cheetah. csd. auth. gr/~lakritid) Leonidas Akritidis 1

Quad Search: A novel metasearch engine (http: //cheetah. csd. auth. gr/~lakritid) Leonidas Akritidis 1 George Voutsakelis 2 Dimitrios Katsaros 1, 2 Panayiotis Bozanis 2 1 Data Engineering Lab, Dept. of Informatics, Aristotle Univ. , Thessaloniki, Hellas 2 Computer & Communication Engineering Dept. , Univ of Thessaly, Volos, Hellas 11 th Panhellenic Conference of Informatics, Patras, Hellas, 18 -20/05/2007

Introduction Single Search Engines Metasearch Engines • Maintenance of a document database • Low

Introduction Single Search Engines Metasearch Engines • Maintenance of a document database • Low Web Coverage • Medium Scalability • Paid Listings Rank Aggregation Methods Metasearch Engines KE Method Antispam Version • Effortless invocation of multiple search engines • No document database • Increased Web Coverage • Improved retrieval effectiveness

Metasearch Engines Introduction Metasearch Engines Rank Aggregation Methods KE Method Antispam Version The Metasearch

Metasearch Engines Introduction Metasearch Engines Rank Aggregation Methods KE Method Antispam Version The Metasearch Engines use the document databases that the component search engines maintain

Rank Aggregation Introduction Metasearch Engines Rank Aggregation Methods KE Method Antispam Version What is

Rank Aggregation Introduction Metasearch Engines Rank Aggregation Methods KE Method Antispam Version What is Rank Aggregation?

Rank Aggregation Methods Introduction Rank Aggregation Methods Metasearch Engines Unweighted Borda Count Rank Aggregation

Rank Aggregation Methods Introduction Rank Aggregation Methods Metasearch Engines Unweighted Borda Count Rank Aggregation Spearman’s Footrule Rank Aggregation Methods Kental’s Tau Markov Chains KE Method Antispam Version

KE Method Introduction Description Metasearch Engines Each result is called candidate Rank Aggregation Each

KE Method Introduction Description Metasearch Engines Each result is called candidate Rank Aggregation Each candidate receives a score (weight), according to the formula below: Rank Aggregation Methods KE Method Antispam Version • r(i): The candidate’s rank in the i-th engine • n: The number of the candidate’s appearances • m: The number of the invoked search engines • k: The length of the top-k list

Antispam Version of the KE Method Introduction Metasearch Engines Rank Aggregation Methods KE Method

Antispam Version of the KE Method Introduction Metasearch Engines Rank Aggregation Methods KE Method Antispam Version We say that a search engine has been spammed by a page when it ranks the page too highly with respect to the other pages, according to the view of a typical user We try to constrain this phenomenon by proposing the Antispam version of the KE Method which can be better described by the following pseudocode: 1. Find which items appear in most than half pages (let the number of these items be c) 2. Apply the KE Method for these items 3. Position them in results’ list, starting at rank 1 4. Apply the KE Method for the rest of the items 5. Position them in results’ list starting at rank c+1

Quad Search’s Architecture Schematic diagram of Quad Search’s Architecture Existing Engines Quad Search Web

Quad Search’s Architecture Schematic diagram of Quad Search’s Architecture Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features

User Interface Features Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot

User Interface Features Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features Quad Search’s User Interface is friendly and simple in order to ensure: • Short download times • Compatibility with all major browsers • Convenient usage For this reason, we avoided using: • Large graphics files • Javascript and AJAX • Flash Presentations

User Interface (Search Hints) Search Hints Existing Engines Quad Search Web Platform Architecture User

User Interface (Search Hints) Search Hints Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features We developed this part of Quad Search to provide: • Detailed information about all its features • Explanation for simple and complex operations • Many helpful examples

Quad Bot (1) Description Existing Engines Quad Search Web Platform Architecture User Interface Quad

Quad Bot (1) Description Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features Quad Bot is responsible for the result retrieval. It consists of the following sub-modules: • Input Validator: It performs security checks • Query Dispatcher: It submits the query to the component search engines simultaneously • Result Collector: It embraces the engines’ responses • Result Validator: It performs multiple conversions to the collected data.

Quad Bot (2 - Architecture) Architecture Existing Engines Quad Search Web Platform Architecture User

Quad Bot (2 - Architecture) Architecture Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features

Web Search APIs What is a Web Search API? Existing Engines Quad Search Web

Web Search APIs What is a Web Search API? Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features API stands for Application Programming Interface. It is a programming tool supplied by the manufacturer of a large scale application A Web Search API is used to retrieve results from major search engines Disadvantages • • Inaccurate results compared to the “mother” engine Queries per Day Limitation Registration IDs required Queries per Registration ID Limitation Quad Search does not make use of Search APIs

Engine Bombing Definition Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot

Engine Bombing Definition Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features Engine Bombing occurs when multiple results from the same domain enter the presented results’ list Many metasearch engines suffer the engine bombing. Engine Bombing Protection Quad Search supports a feature to limit the different results coming from same domain

Results Filtering Provided Filters Existing Engines Quad Search Web Platform Architecture User Interface Quad

Results Filtering Provided Filters Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features • Antispam Filter: Application of the antispam version of the KE Method • Ranking Algorithm Selector: Quad Search provides an option to determine how the collected results will be ranked • Engine Bombing Protection

Advanced Web Search Advanced Search Filters Existing Engines Quad Search Web Platform Architecture User

Advanced Web Search Advanced Search Filters Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features • File Type Selector: The user can perform searches for files of specific format (PDF, DOC, XLS and PPT) • Language Filter: Quad Search can return documents written in a specifed language • Domain Filter: The user can search a given domain, or exclude a domain from a search • Date Filter: Return results updated in the past 3, 6, or 12 months

Web Search Options Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot

Web Search Options Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features Quad Search provides the user with the ability to set options that will be used in future searches Some of these options are: 1. Connection Timeout Feature. How long Quad Search should wait a search engine to respond 2. Determine the number of candidates to be collected per component engine 3. Determine the number of results to be displayed per result page 4. Determine whether the results will be opened in a new browser window

Results Presentation (1) Classic View: The results are displayed in the classic way Existing

Results Presentation (1) Classic View: The results are displayed in the classic way Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features Array View: The results are displayed in a ranked array. The user can watch the results and their rankings easier

Results Presentation (2) Results Page Existing Engines Quad Search Web Platform Architecture User Interface

Results Presentation (2) Results Page Existing Engines Quad Search Web Platform Architecture User Interface Quad Bot Web Search APIs Engine Bombing Results Filtering Advanced Search Options Result Presentation Extra Features The results page is highly customizable. A relative screenshot is depicted below

Scientific Search General Features Scientific Search Related Work H-Index Search Options Advanced Search Cache

Scientific Search General Features Scientific Search Related Work H-Index Search Options Advanced Search Cache Extra Features Quad Search is capable of searching for scientists, authors and/or published articles Google Scholar provides the required data Quad Search collects the data and produces statistics and charts

H-Index Definition Scientific Search Related Work H-Index Search Options Advanced Search Cache Extra Features

H-Index Definition Scientific Search Related Work H-Index Search Options Advanced Search Cache Extra Features The h-index is an index for quantifying the scientific productivity of physicists and other scientists based on their publication record A scientist has index h if h of his Np papers have at least h citations each, and the other (Np - h) papers have no more than h citations each Quad Search computes h-index when the user does a search for authors

Scientific Search Options Scientific Search Related Work H-Index Search Options Advanced Search Cache Extra

Scientific Search Options Scientific Search Related Work H-Index Search Options Advanced Search Cache Extra Features The scientific search part of Quad Search offers a variety of options that can be stored and used in future searches The user can define • The results’ language • The results’ subject area (biology, chemistry, physics, engineering, medicine etc) • The number of results to be displayed per page • If the results will be opened in the current or in a new window

Extra Features - Charts Scientific Search Related Work H-Index Search Options Advanced Search Cache

Extra Features - Charts Scientific Search Related Work H-Index Search Options Advanced Search Cache Extra Features The user can visually check the number of cites per paper of a specified author. This feature is applicable for “Author Searches”

Extra Features – Excluding Papers Scientific Search Related Work H-Index Search Options Advanced Search

Extra Features – Excluding Papers Scientific Search Related Work H-Index Search Options Advanced Search Cache Extra Features When a user performs an “Author Search”, Quad Search transfers all results from Google Scholar (or its cache) Possibly, some of these articles should not participate in the calculations (e. g. the h-index) The user can exclude the papers that should not participate in the calculations, by deselecting the appropriate checkbox

Future Work Our plans for Quad Search Future Work Concluding remarks • Support for

Future Work Our plans for Quad Search Future Work Concluding remarks • Support for extra ranking algorithms (e. g. Markov chains) • Geography aware search for News • News Search with RSS feeds • Wide Personalization (users, profiles, topics of interest, stored multimedia and user defined customization) • Image and Video searches • Searches in P 2 P networks (e-donkey, g-nutella, etc) • Torrent Searches

Concluding Remarks Conclusions Future Work Concluding remarks • In this session, we presented a

Concluding Remarks Conclusions Future Work Concluding remarks • In this session, we presented a pair of rank aggregation algorithms, KE Method and its antispam version • We injected some new parameters like the number of the top-k lists that a page appears and the total number of the exploited search engines • We also presented a novel meta-search engine, Quad Search • Quad Search offers a wide variety of new features for web search, like the ranking algorithm selector, the engine bombing protection etc • Quad Search also provides options for searches for scientific articles. It also computes statistics like hindex