Buzz Track Topic Detection and Tracking in Email
Buzz. Track Topic Detection and Tracking in Email IUI – Intelligent User Interfaces January 2007 Gabor Cselle Google gabor@google. com Keno Albrecht ETH Zurich kenoa@tik. ee. ethz. ch Roger Wattenhofer ETH Zurich wattenhofer@tik. ee. ethz. ch
Email Overload • Email clients were not designed to handle volume and variety of messages users are dealing with today: • Large volumes of email • Task Management • Personal Archiving or Filing • Keeping Context [Whittaker and Sidner, 1996] 2
Search vs. Inbox Browsing • Fast full-text search is today's solution to finding past emails. • But the flat inbox view of newly incoming emails hasn’t changed. In our work, we focus on the problem of sensibly structuring emails in the inbox. 3
Today's Email Clients: The Three -Pane View No sense of context: unrelated messages are shown together Important emails may drop off the “first screen” “Thread-based” tree views are unsophisticated, may not pull in all relevant messages. 4
Buzz. Track Email client extension for Mozilla Thunderbird for displaying email grouped by topic. 5
Related Work 6
Visualizations: Conversations Gmail (Google) common conversation title one entry per email, folds out on click 7
Automatic Foldering • Using machine learning techniques to automatically move emails into folders upon arrival • Low accuracy rates [Bekkerman et al, 2005], conceptual problems: • Users need to manually create folders and seed them with data. 8
People-Centered Email Clients Bifrost Contact. Map [Bälter and Sidner, 2002] [Whittaker et al. , 2004] 9
Task-based Email Example: Task. Master [Belotti et al. , 2003] thrasks thrask contents item contents (emails, documents, etc. ) 10
Buzz. Track 11
Buzz. Track • Mozilla Thunderbird extension to automatically group related emails into topics. • Will be distributed through website: www. buzztrack. net • Provides a view on the user’s inbox. 12
What’s a Topic? • Topics are groups of emails that relate to the same idea, action, event, task, or question. • Examples: • A conversation about buying a digital camera. • Referring a candidate for a job. • All emails belonging to same newsgroup. 13
Clustering Process • For every new incoming email: Preprocessing Cluster store Buzz. Track View in Thunderbird Label generation 14
Preprocessing • Tokenization (remove HTML tags, style sheets, punctuation, and numbers) • Language detection • Stemming • For topic labelling: • Identify Parts-of-speech • Remember popular original word forms 15
Clustering • Single-link clustering: Newly incoming emails are compared to every email in existing topics: • Similarity value > threshold: assigned to topic • Similarity value <= threshold: email starts new topic 16
Features - 1 • How do we generate similarity values between emails? • Via a linear combination of several similarity features. • Examples: • Text similarity (TFIDF Value, cosine similarity metric) • People similarities (comparing sets of people in the From / To / Cc lines of email headers) • Thread membership 17
Features - 2 Other features for deriving similarities: • Subject similarity • Sender domain overlaps • Sender rank and percentage • % of email from sender that is answered • Time passed since last email in topic • People and reference count for email • Known people and reference % • Cluster size • Has attachment 18
Decision Score Similarities are combined into a decision score for each email / cluster pair through a linear combination of feature values: deci, j = wa*sima(mi, Cj) + wb*simb(mi, Cj) + … We tested two sets of weights wx, both trained on a development set of emails: • Empirical • Linear SVM 19
Evaluation • How do we evaluate clustering quality? • Topic Detection and Tracking competitions by NIST. Aimed at clustering news articles. • Corpus: 20
Clustering Tasks • Clustering Task is split into subtasks: • New Topic Detection (NTD): Given stream of emails, which ones start new topics? • Topic Tracking (TT): Given a fixed topic, which newly incoming emails belong to it? • DET Curves plot miss rate vs. false alarm rate for possible threshold for decision scores 21
Results NTD • TDT New Topic Detection Task better Miss: 3% False alarm: 30% better 22
Results TT • TDT Topic Tracking Task better Miss: 8% False alarm: 2% better 23
Comparison • Comparable quality to TDT for news articles [NIST 2004] • News has less metadata, email has worse text quality. • Wide body of work exists on improving clustering performance on news, we haven’t tapped into that yet. 24
Buzz. Track View • Mozilla Thunderbird plugin that provides useful view on inbox data “for free” • Topics contain email from last 60 days • We’re interested in current email only • Reduces initial clustering time • Each email is shown in one topic 25
26
Demo 1: Buzz. Track 27
Buzz. Track Panes Topic pane: • Provides additional info • Starred topics Email pane: • Topics sorted by last incoming email 28
Future Work • Distribute plugin to Thunderbird users • Input on possible UI improvements • Input on clustering quality • Different clustering styles • People-based • Thread-based • We hope Buzz. Track will be valuable tool for real-world users 29
Questions? Contact: Gabor Cselle, mail@gaborcselle. com Website: www. buzztrack. net 30
- Slides: 30