Entity Rank Searching Entities Directly and Holistically Tao

  • Slides: 28
Download presentation
Entity. Rank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan

Entity. Rank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam Ph. D Student CSE Department, UTA

Motivating Scenario Customer service phone number of

Motivating Scenario Customer service phone number of

Search on Amazon?

Search on Amazon?

Search on Google?

Search on Google?

Many many Similar Cases �The email of Luis Gravano? �What profs are doing databases

Many many Similar Cases �The email of Luis Gravano? �What profs are doing databases at UIUC? �The papers and presentations of ICDE 2007? �Due date of SIGMOD 2008? �Sale price of “Canon Power. Shot A 400”? �“Hamlet” books available at bookstores? Often times, we are looking for data entities, e. g. emails, dates, prices, etc, not pages.

What you search is not what you want.

What you search is not what you want.

From pages to entities Traditional Search Entity Search Results Keywords Entities Support

From pages to entities Traditional Search Entity Search Results Keywords Entities Support

Concretely, what is meant by Entity Search?

Concretely, what is meant by Entity Search?

Entity Search Problem: Given: Input: Keywords & Entities (optionally with a pattern) E. g.

Entity Search Problem: Given: Input: Keywords & Entities (optionally with a pattern) E. g. Amazon Customer Service #phone 0. 90 0. 80 Output: Ranked Entity Tuples 0. 60 … … 9

Challenge: How to rank Entities? 10

Challenge: How to rank Entities? 10

Characteristics I: Contextual -Utilize Entities’ Surrounding Context Content

Characteristics I: Contextual -Utilize Entities’ Surrounding Context Content

Characteristics II: Uncertain -Extractions are non”prefect”

Characteristics II: Uncertain -Extractions are non”prefect”

Characteristics III: Holistic -Many evidences from multiple sources

Characteristics III: Holistic -Many evidences from multiple sources

Characteristics IV: Discriminative - Web Pages are of Varying Quality

Characteristics IV: Discriminative - Web Pages are of Varying Quality

Characteristics V: Associative -Tell True Associations from Accidental Example: Finding Prof. Luis Gravano’s Email

Characteristics V: Associative -Tell True Associations from Accidental Example: Finding Prof. Luis Gravano’s Email Observation: info@acm. org appears very frequently with keywords “Luis”, “Gravano” However, such association is only accidental as info@acm. org appears on many pages.

Entity. Rank: The Impression Model 0. 90 Validation Layer: Hypothesis Testing 0. 80 Tireless

Entity. Rank: The Impression Model 0. 90 Validation Layer: Hypothesis Testing 0. 80 Tireless Observer ? ? . . ? ? 0. 60 … Recognition Layer: Local Assessment Access Layer: Global Aggregation …

Recognition Layer: Local Assessment Input: L 1 L 2 Output: Contextual Uncertain Holistic Discriminative

Recognition Layer: Local Assessment Input: L 1 L 2 Output: Contextual Uncertain Holistic Discriminative Associative 17

Access Layer: Global Aggregation Input: Contextual Uncertain Holistic Discriminative Output: Holistic Discriminative Associative 18

Access Layer: Global Aggregation Input: Contextual Uncertain Holistic Discriminative Output: Holistic Discriminative Associative 18

Validation Layer: Hypothesis Testing Input: Virtual Collection E’ over D’ Collection E over D

Validation Layer: Hypothesis Testing Input: Virtual Collection E’ over D’ Collection E over D randomize Output: Contextual Uncertain Holistic Discriminative Associative 19

Entity. Rank: The Scoring Function Validation Global Aggregation Local Recognition

Entity. Rank: The Scoring Function Validation Global Aggregation Local Recognition

Query Processing Hypothesis Test Result 800 -202 -7575: p 4 800 -322 -9266: p

Query Processing Hypothesis Test Result 800 -202 -7575: p 4 800 -322 -9266: p 5 Aggregation wv 800 -202 -7575: p 2 800 -202 -7575: p 1 800 -322 -9266: p 3 Sort-merge Join Doc Posting Doc d 1 d 3 d 6 d 7 d 9 d 3 11 d 5 66 d 7 8, 24 d 3 12 d 7 9 d 8 44 d 2 d 3 d 7 Customer Service 8, 25 5 10 3 7, 33 Amazon Posting (42, 851 -0400, 0. 8) (18, 800 -2027575, 1. 0) (13, 800 -2027575, 1. 0) (78, 800 -3229266, 1. 0) #phone 21

Experiment Setup Corpus: General crawl of the Web(Aug, 2006), around 2 TB with 93

Experiment Setup Corpus: General crawl of the Web(Aug, 2006), around 2 TB with 93 M pages. Entities: Phone (8. 8 M distinctive instances) Email (4. 6 M distinctive instances) System: A cluster of 34 machines 22

Comparing Entity. Rank to the Following Different Approaches Contextual Uncertain Holistic Discriminative Associative Naïve

Comparing Entity. Rank to the Following Different Approaches Contextual Uncertain Holistic Discriminative Associative Naïve Local Global Combine Without Entity. Rank 23

Online Demo.

Online Demo.

Example Query Results 25

Example Query Results 25

Conclusions Formulate Study search the entity search problem and define the characteristics of entity

Conclusions Formulate Study search the entity search problem and define the characteristics of entity Conceptual Impression Model and concrete Entity. Rank framework for ranking entities An online prototype with real Web corpus 26

Thank You ! Questions?

Thank You ! Questions?

Reference �Entity. Rank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K.

Reference �Entity. Rank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C. -C. Chang. In Proceedings of the 33 rd Very Large Data Bases Conference (VLDB 2007), pages 387398, Vienna, Austria, September 2007 � http: //www-forward. cs. uiuc. edu/talks/2007/entityrankvldb 07 -cyc-sep 07. ppt