Implicit Queries for Email Vitor R Carvalho Joint

  • Slides: 17
Download presentation
Implicit Queries for Email Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft

Implicit Queries for Email Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)

Search + Email n Email is the number 1 activity on the internet ü

Search + Email n Email is the number 1 activity on the internet ü Fast, easy and cheap n Search is number ü 2 Highly lucrative (billion market – targeted ads) n Why not put them together? ü Make users happy ü Make more money 2

3

3

Implicit Queries for Email n Find good search keywords in email messages ü 1

Implicit Queries for Email n Find good search keywords in email messages ü 1 Click (or less) for users to do search n Lots of possible User Interfaces ü Add hyperlinks to words in message ü List keywords in a sidebar ü Perform search automatically; show results (Gmail) n Closely related to finding keywords for advertising 4

Main Contributions 1) Extract Keyphrases ü Similar to Information Extraction ü Several features 2)

Main Contributions 1) Extract Keyphrases ü Similar to Information Extraction ü Several features 2) Rank/Display ü Maxent probability estimates 3) Select/Filter ü Restrict to MSN Query Logs (7. 5 million entries) 5

Email Dataset § 20 Hotmail volunteers (not MS employees) § Spam, “subs” and “wanted”

Email Dataset § 20 Hotmail volunteers (not MS employees) § Spam, “subs” and “wanted” folders § 6 human annotators labeled 1143 msgs according to the following instructions: These are mail messages from real Hotmail users. Imagine that you were the recipient of each message. If your email program were to automatically perform a query to a search engine like MSN Search or Google for you, what words or phrases would you want the engine to search for? In some messages, there may be no words worth searching for. In others, there may be several. When possible, the words or phrases should actually occur in the messages you annotate. 6

TF-IDF baseline § Extract all possible keyphrases from email (up to 5 tokens) §

TF-IDF baseline § Extract all possible keyphrases from email (up to 5 tokens) § Rank keyphrases by TF-IDF scores § TF = term frequency: number of times each keyphrase occurs in the email message IDF = 1/DF = number of documents the keyphrase occurs in corpus § § Top 1 – percentage of “ranked-1 st keyphrases” that were labeled as relevant § Top 10 – number of keyphrases in the top-10 rank that were labeled as relevant, normalized by the total number of relevant keyphrases (no Keyphrases TF-IDF Port Angeles 0. 450 Lake Crescent 0. 120 Atlanta 0. 090 Mt. Baker 0. 045 … … TF-IDF Top-1 (%) Top-10 (%) 4. 87 9. 86 message had more than 10 relevant keyphrases) 7

First Improvement: Constrain Results to Query Log File § § Query log file: top

First Improvement: Constrain Results to Query Log File § § Query log file: top 7. 5 million queries to MSN Search Only return keyphrases from an email if they occur in the query log file ü Faster – only process keyphrases in message that occur in the query log file. ü Creates some errors ü Removes some errors – such as “occur in the” ü Works better! Top-1 (%) Top-10 (%) TF-IDF 4. 87 9. 86 TF-IDF 10. 86 30. 56 with query log restriction 8

Adding More Features 1) Query Log Frequency ü 2) Capitalization ü 3) TF, IDF,

Adding More Features 1) Query Log Frequency ü 2) Capitalization ü 3) TF, IDF, from Body and from Subject Punctuation and Alphanumeric ü 6) Number of characters and number of tokens TF + IDF based features ü 5) Word capitalized before/after, # capitalized initials in phrase, # capitalized letters in phrase, etc Phrase Length ü 4) Frequency and log(frequency) of keyphrase Punct before/after, has no alpha, has numbers only, etc Email Specific ü Has FW: in subject, has RE: in subject 9

Maximum Entropy Learner (a. k. a. Logistic Regression) § Computes § y is 1

Maximum Entropy Learner (a. k. a. Logistic Regression) § Computes § y is 1 if keyphrase is relevant § is the feature vector (previous slide features) § Weight vector w learned using a type of Generalized Iterative Scaling alg. (SCGIS). § Rank and cutoff based on probability estimate 10

Rank and cutoff based on probability Keyphrases n Port Angeles n Lake Crescent n

Rank and cutoff based on probability Keyphrases n Port Angeles n Lake Crescent n Olympic National Park n Atlanta n Mt. Baker n Hurricane Ridge n Marymere Fall n Beaches on the west coast Probability Cutoff = 10% 0. 121 0. 105 0. 034 0. 031 0. 022 0. 012 0. 009 0. 004 11

Performance Analysis Top-10 4. 87 9. 86 TF-IDF (one single feature) 10. 86 30.

Performance Analysis Top-10 4. 87 9. 86 TF-IDF (one single feature) 10. 86 30. 56 Baseline → 2 features: TF and IDF 11. 33 32. 03 Baseline + Query Frequency 23. 13* 41. 82* Baseline + Phrase Length 12. 81 33. 25 Baseline + Capitalization 21. 43* 44. 71* Baseline + Punctuation 13. 47 33. 02 Baseline + Email Specific 11. 34 32. 03 Baseline + Alphanumeric 11. 66 32. 65 33. 55* 55. 26* TF-IDF (one single feature and no query log restriction) Baseline + All Features 10 -fold cross-validation on the 1143 email messages 12

Performance Analysis 13

Performance Analysis 13

Using Other Learning Algorithms 14

Using Other Learning Algorithms 14

Opportunities for Future Work 1. Relax the Query Log restriction 2. Use real advertisement

Opportunities for Future Work 1. Relax the Query Log restriction 2. Use real advertisement data 3. Use feedback from users (user can be annoyed, etc) 4. Use personalization (age, gender, place, etc) 15

Conclusions n n n Implicit Query task → finding good search keywords Use of

Conclusions n n n Implicit Query task → finding good search keywords Use of large query log from MSN Search Maxent to combine features and output probabilities – ranking and display cutoff Most meaningful features are associated with query frequency and capitalization Results several times better than baseline TF-IDF (top 1 and top 10 scores) 16

Thank you 17

Thank you 17