Techniques for Collaboration in Text Filtering Ian Soboroff
Techniques for Collaboration in Text Filtering Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County ian@cs. umbc. edu Techniques for Collaboration in Text Filtering 1
Overview • • Text filtering and collaborative filtering Finding collaboration among content profiles Experimental results Ongoing work Techniques for Collaboration in Text Filtering 2
Information Filtering • Given • a stream of documents (news articles, movies) • a set of users (with stable and specific interests) • Recommend documents to users who will be interested in them • "Tell me when a jazz CD comes out that I'll like. " • "Tell me when an earthquake is reported. " Techniques for Collaboration in Text Filtering 3
Content Filtering • Construct profiles from example documents • vector of weights for terms in documents • can use known relevant and nonrelevant docs • can use external resources such as a home page, job description, or research papers • Match new documents against content profiles Techniques for Collaboration in Text Filtering 4
Filtering in a Community • Many people will be watching the same stream • Some of them may have overlapping interests • earthquakes, mideast politics, building codes, Turkey • Charles Mingus, Duke Ellington, Kenny G • Want to take advantage of group effort Techniques for Collaboration in Text Filtering 5
"Pure" Collaborative Filtering • collect users' ratings for documents • thumbs up/down, or 1 -5 scale • compute correlations among users • predict ratings for new/unseen items using existing ratings and correlation values Techniques for Collaboration in Text Filtering 6
Pure CF Example Comedies Alice Dramas 5 7 Bob ? 9 7 ? 2 9 Carmen 4 9 7 8 1 8 Doug ? 9 Techniques for Collaboration in Text Filtering 7
Combining Content and Collaboration • Pure collaborative filtering • can recommend anything • must have ratings to give predictions • don't know much about documents or ratings • Adding content to collaboration • content filtering can recommend an unrated document • exploit common themes among content profiles Techniques for Collaboration in Text Filtering 8
One Approach to CBCF • Construct content profiles • Documents are vectors of weighted features • Build profiles from known relevant and nonrelevant documents • Collaborative step • Combine profile vectors into single matrix • Compute latent semantic index of profile collection • Route new documents in profiles' "LSI space" Techniques for Collaboration in Text Filtering 9
Latent Semantic Indexing wtd t d = T DT r r r d t r • Compute singular value decomposition of a content matrix • D, a representation of M in r dimensions • T, a matrix for transforming new documents • gives relative importance of dimensions Techniques for Collaboration in Text Filtering 10
Collaborating with LSI • LSI dimensions are. . . • based on term co-occurrence patterns between documents (profiles) • ordered by their prominence in collection • LSI space built from profiles • highlights common patterns among profiles • "noisy" dimensions can be pruned • project new documents into a collaborative space for routing Techniques for Collaboration in Text Filtering 11
Experiments with Cranfield • Cranfield, a standard (if small) IR collection • 1398 documents, 255 scored queries • Profiles: selected Cranfield queries • 26 queries with ³ 15 relevant documents • 70% of profile's relevant docs used in each profile • Results shows improvement for using LSI of profiles • compared to using profiles alone • compared to using LSI of all of Cranfield Techniques for Collaboration in Text Filtering 12
Results: Average Precision k-value Content (log-tfidf) Content LSI 25 50 (LSI of all of Cranfield) 100 200 500 Collaborative LSI 8 15 (LSI of profiles) 18 Techniques for Collaboration in Text Filtering Set 1 0. 2894 0. 2656 0. 3136 0. 3251 0. 3314 0. 3302 0. 3136 0. 4151 0. 3600 Set 2 0. 2705 0. 1980 0. 2686 0. 3053 0. 3144 0. 3149 0. 2583 0. 3745 0. 3615 13
Results: Precision-Recall Techniques for Collaboration in Text Filtering 14
Experiments with TREC • TREC-8 routing task • Profiles: 50 topics (351 -400) • Test Documents: Financial Times 1993 -4 • Training Documents: FT 92, LA Times 89 -90, FBIS • Building profiles • short topic description • known relevant documents in training set • sample of non-relevant documents from training set Techniques for Collaboration in Text Filtering 15
Average Precision in TREC • Average precision. . . • with profiles alone = 0. 4464 • with profile LSI = 0. 3971 • LSI shows no improvement over original profiles • Some topics conceivably have common interests • "hydrogen energy"; "hydrogen fuel automobiles"; "hybrid fuel cars" • "clothing sweatshops"; "human smuggling" • But too little training overlap? Techniques for Collaboration in Text Filtering 16
Conclusions • LSI can improve filtering performance • but might not, if SVD can't find anything to work with • LSI of profiles is much cheaper to compute than LSI of a whole collection (or even a sample!) Techniques for Collaboration in Text Filtering 17
Current and Future Work • Looking at other collections • More TREC! • Reuters-21578 • Collaborative filtering collections. . . such as? • Looking at other techniques • Comparison to collaboration alone? • Other methods of combining content and collaboration Techniques for Collaboration in Text Filtering 18
- Slides: 18