Effective Ranking Fusion Methods for Personalized Metasearch Engines

  • Slides: 27
Download presentation
Effective Ranking Fusion Methods for Personalized Metasearch Engines Leonidas Akritidis Dimitrios Katsaros Panayiotis Bozanis

Effective Ranking Fusion Methods for Personalized Metasearch Engines Leonidas Akritidis Dimitrios Katsaros Panayiotis Bozanis Department of Computer And Communication Engineering University of Thessaly, Volos, Hellas 12 th Pan-Hellenic Conference on Informatics, Samos Island, Hellas, 28 -30/08/2008

Introduction Single Search Engines o o Maintenance of a document database. Maintenance of an

Introduction Single Search Engines o o Maintenance of a document database. Maintenance of an Index that allows fast searching. Metasearch Engines o o o Invocation of multiple search engines. No document database. No Index. Collection of results from component engines. Use of a Rank Aggregation Algorithm to merge the results. http: //quadsearch. csd. auth. gr

Rank Aggregation What is Rank Aggregation? o The collected data is merged to a

Rank Aggregation What is Rank Aggregation? o The collected data is merged to a final unordered list. o A Rank Aggregation procedure proposes a way to sort this list. Why do we need Rank Aggregation? o To provide robust search on the Web. o Meta-searching. o Spam problem. http: //quadsearch. csd. auth. gr

Existing Rank Aggregation Methods The most common rank aggregation methods are: o Spearman’s Footrule

Existing Rank Aggregation Methods The most common rank aggregation methods are: o Spearman’s Footrule o Kental’s Tau o Borda Count o Markov Chains o KE Algorithm and its Anti. Spam version http: //quadsearch. csd. auth. gr

KE Algorithm o o o The algorithm considers each result as a candidate. Each

KE Algorithm o o o The algorithm considers each result as a candidate. Each candidate receives a score (weight), according to the formula below: r(i): The candidate’s rank in the i-th engine. n: The number of the candidate’s appearances. m: The number of the invoked search engines. k: The length of the top-k list. http: //quadsearch. csd. auth. gr

Motivations o o o No public information on the ranking algorithms used by metasearch

Motivations o o o No public information on the ranking algorithms used by metasearch engines. Some sporadic works concern the way that search engines should construct their index, by exploiting the geographical location information of Web pages. None of these techniques can be used by metasearch engines The existing Algorithms: o o Cannot accept data varying among different users (subjective data) and produce different results respectively. Output the same results for the same queries, submitted by different users. http: //quadsearch. csd. auth. gr

Proposed Algorithms We introduce four new algorithms: o o Geo. KE Algorithm Weighted KE

Proposed Algorithms We introduce four new algorithms: o o Geo. KE Algorithm Weighted KE Algorithm URL Aware KE Algorithm Global KE Algorithm All proposed algorithms o o Manage to confront the mentioned problems. Are score based (that is, assign scores to the collected results). Can be fine tuned by the user, enabling their usage in personalized search systems. Have been implemented in a fully functional metasearch engine, Quad. Search, available at http: //quadsearch. csd. auth. gr

Geography Aware Ranking (1) o Users often seek for information that is directly linked

Geography Aware Ranking (1) o Users often seek for information that is directly linked to a specific geographical region. o For instance, travel related queries are usually connected to the travel’s destination. o There is strong evidence that a considerable number of search engine queries are geographically oriented. o The two or three final characters of the domain name usually declare the originating country of that page. http: //quadsearch. csd. auth. gr

Geography Aware Ranking (2) o Example Query: “Hotels in Paris”. o Significant possibility that

Geography Aware Ranking (2) o Example Query: “Hotels in Paris”. o Significant possibility that pages hosted under. fr domain extension, contain more valuable information than these hosted under other extensions. o Assumption: a hypothetical user from United Kingdom searches for “Hotels in Paris”. o The pages under. fr domains are usually written in French. o The hypothetical British user must be familiar to French to understand the provided information. o A page with. uk domain extension would probably be best result. Pages with. us, . au or. ca domain extensions would also be good choices. http: //quadsearch. csd. auth. gr

Geo. KE Algorithm o Each candidate receives a score (weight), according to the formula:

Geo. KE Algorithm o Each candidate receives a score (weight), according to the formula: o Where G is the Geo. KE Coefficient. Its value depends on a variety of cases and can be any integer. o A result with lower score will achieve better ranking than a result with greater weight. http: //quadsearch. csd. auth. gr

Geo. KE Algorithm Implementation (1) o The user’s locality is automatically obtained by using

Geo. KE Algorithm Implementation (1) o The user’s locality is automatically obtained by using a database that is specially designed for matching IP addresses against geographical locations. o Another local database table is used to store the relationships between the geographical regions and their respective languages. These relationships are stored as {region - friendly region} pairs. o The user’s locality is compared to the locality of the page (revealed by its domain extension). The result of this comparison leads to one of the following four cases. http: //quadsearch. csd. auth. gr

Geo. KE Algorithm Implementation (2) The Algorithm considers the following cases 1. The domain

Geo. KE Algorithm Implementation (2) The Algorithm considers the following cases 1. The domain extension of the result and the user’s region are the same. 2. The user can understand the language that the page is written (Example: Brazilian user and Portuguese pages). 3. The domain extension of the page does not reveal any information about its locality (. com, . net, . edu, . org etc). 4. The page is written in a language that the user can’t understand. http: //quadsearch. csd. auth. gr

Weighted KE Algorithm (1) o All methods examined so far, treat all component engines

Weighted KE Algorithm (1) o All methods examined so far, treat all component engines equally. o It is a common intuition, that a single search engine can’t perform equally well for all types of queries. o Each engine maintains its own document index. o Every time a query is submitted, it searches different documents and presents different results than another engine does. o For such occasions, an effective scoring algorithm must provide the user with the ability to modify the importance of each component engine. http: //quadsearch. csd. auth. gr

Weighted KE Algorithm (2) o The Weighted KE Algorithm introduces the weighting of the

Weighted KE Algorithm (2) o The Weighted KE Algorithm introduces the weighting of the component search engines. Each candidate receives a score (weight), according to the formula: o Where e(i) is the Weight Factor of the ith Engine (EWF). EWF can receive integer values. The equation above leads to the same ranking as the original version of the algorithm, unless a different EWF is assigned to at least one engine. o o http: //quadsearch. csd. auth. gr

Weighted KE Algorithm Implementation o In our implementation in Quad. Search, the user is

Weighted KE Algorithm Implementation o In our implementation in Quad. Search, the user is free to assign integer weights to the four component search engines (Google, Yahoo, Live Search and Ask. com). o The user selects weights for each engine. The weights can be any integer between 1 and 10. o To increase the importance of the results coming from an engine with higher weight, we must decrease their score. o Thus, the user selected weights are being silently subtracted by eleven (11), then applied to the Weighted KE Algorithm. http: //quadsearch. csd. auth. gr

Domain Name Analysis (1) o The phenomenon of Web pages appearing under different subdomains

Domain Name Analysis (1) o The phenomenon of Web pages appearing under different subdomains of the same domain name is very usual these days. o In fact, subdomains are simple folders in a Web server’s public directory and can contain multiple pages with similar or unique informational material. o Examples of such pages are the Departments of a University, or the personal pages of the academic staff of a faculty. http: //quadsearch. csd. auth. gr

Domain Name Analysis (2) o There is a danger that many pages with similar

Domain Name Analysis (2) o There is a danger that many pages with similar contents or many pages from one source would appear in the engine’s result list. o This danger is even greater in metasearch engines, where more than one component engines are being exploited. o The top-10 list that Google returns for the query “Aristotle University of Thessaloniki” contains 10 results, all having the term “auth. gr” in their domain name. o Obviously, the limitation of no more than two pages with the same domain name in the same result list does not cover subdomains. http: //quadsearch. csd. auth. gr

URL Aware KE Algorithm (1) o o o Solution: Give these pages a lower

URL Aware KE Algorithm (1) o o o Solution: Give these pages a lower ranking. The URL Aware KE Algorithm assigns scores to the candidates according to the formula: D is the Domain Awareness Constant (DAC). A higher value of D will decrease the result’s weight, therefore improve its ranking. The following four distinct cases determine the value that the Domain Awareness Constant receives. http: //quadsearch. csd. auth. gr

URL Aware KE Algorithm (2) 1. 2. 3. 4. The result has a domain

URL Aware KE Algorithm (2) 1. 2. 3. 4. The result has a domain name that is not repeated more than two times in the result list. The result has a domain name that is repeated more than two times in the result list, but this result is the best among the others having the same domain name (it has received the best rankings from component engines). The result has a domain name that is repeated more than two times in the result list and it is not the best among the results with the same domain name. Its domain is central. The result has a domain name that is repeated more than two times in the result list and it is not the best among the results with the same domain name. The result appears under a subdomain of a central domain. http: //quadsearch. csd. auth. gr

Global KE Algorithm (1) o The Global KE Algorithm derives from the combination of

Global KE Algorithm (1) o The Global KE Algorithm derives from the combination of the three scoring methods described above. o Each result is assigned a score that is determined by the following formula: http: //quadsearch. csd. auth. gr

Global KE Algorithm (2) The Global KE Algorithm gives the user the ability to

Global KE Algorithm (2) The Global KE Algorithm gives the user the ability to define: o The engine’s importance for a single query. Not all search engines are treated equally for all types of queries. o How the geographic origin of a page and the language it is written, affect its ranking. o How the domain name structure of a Web page affects its ranking. http: //quadsearch. csd. auth. gr

Time Complexity o Result Classification’s performance (seconds) for the proposed algorithms, for various numbers

Time Complexity o Result Classification’s performance (seconds) for the proposed algorithms, for various numbers of input results for the query “Athenian Acropolis”. 40 results 80 results 120 results Original KE 0. 03 0. 06 0. 11 Weighted KE 0. 03 0. 06 0. 11 Geo. KE 0. 03 0. 06 0. 13 URL Aware KE 0. 03 0. 06 0. 11 Global KE 0. 03 0. 06 0. 14 http: //quadsearch. csd. auth. gr

Time Complexity o None of the introduced algorithms harms the performance of the results

Time Complexity o None of the introduced algorithms harms the performance of the results classification process significantly. No significant overhead comes out of their use. o One might expect that the use of Geo. KE Algorithm would result a slower query execution, as a connection to a database and numerous data transfers are required. o In this occasion, keeping the appropriate data in the server’s main memory eliminates this drawback. http: //quadsearch. csd. auth. gr

Example Query – Test of Usefulness o The Geo. KE Algorithm proved very useful

Example Query – Test of Usefulness o The Geo. KE Algorithm proved very useful for travel related queries and generally for cases where the quality of the presented results is affected by geographical information. o We compared the top-10 list returned by the Geo. KE Algorithm, against the list returned by the original KE Algorithm, for the query “Athenian Acropolis”. o The query is submitted by a hypothetical user located in Greece. o The exploited result resources were the four major commercial search engines, Google, Yahoo, Live Search and Ask. com. All engines were considered to be of equal weight. http: //quadsearch. csd. auth. gr

Example Query Results Original KE Algorithm Geo. KE Algorithm 1 witcombe. sbc. edu 2

Example Query Results Original KE Algorithm Geo. KE Algorithm 1 witcombe. sbc. edu 2 www. bluffton. edu 3 www. metrum. org www. acropolisofathens. gr 4 people. hsc. edu www. metrum. org 5 plato-dialogues. org people. hsc. edu 6 www. wikitravel. org plato-dialogues. org 7 en. wikipedia. org www. wikitravel. org 8 www. acropolisofathens. gr en. wikipedia. org 9 www. amazon. com 10 www. reconstructions. org www. culture. gr http: //quadsearch. csd. auth. gr

Comments o The Geo. KE Algorithm assigns scores to the collected results with respect

Comments o The Geo. KE Algorithm assigns scores to the collected results with respect to the user’s and the result’s locality. o The algorithm’s scoring formula has given a higher rank to the Greek oriented result www. acropolisofathens. gr (rank: 3), than the original KE Algorithm did (rank: 8). o Moreover, the top-10 list formed by the Geo. KE Algorithm contains one more result with Greek origin, placed in the tenth position (www. culutre. gr). http: //quadsearch. csd. auth. gr

Conclusions o o o We presented three innovative ranking fusion methods and their generalization,

Conclusions o o o We presented three innovative ranking fusion methods and their generalization, the Global KE Algorithm that derives from their combination. All three methods have been implemented and can be tested in Quad. Search, our experimental metasearch engine that can be found in http: //quadsearch. csd. auth. gr. The open architecture of the proposed algorithms, allows user defined modifications and further fine-tuning is possible. We have tested and proved the usefulness of the algorithms for various types of queries, especially the Geo. KE Algorithm. The computational cost a ranking system suffers from their use is negligible. http: //quadsearch. csd. auth. gr