Automatic Web Tagging and Person Tagging Using Language
Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei†, Yi Zhang‡ Presented by Jessica Gronski‡ † University of Illinois at Urbana-Champaign ‡ University of California at Santa Cruz
Tagging a Web Document • The dual problem of search/retrieval: [Mei et al. 2007] – Retrieval: short description (query) relevant documents – Tagging: document short description (tag) • To summarize the content of documents • To access the document in the future retrieval Text Document Query/Tag tagging 2
Social Bookmarking of Web Documents Web documents Social bookmarks (tags) 3
Existing Work on Social Bookmarking • Social Bookmarking Systems – Del. icio. us, Digg, Citeulike, etc. • Enhance Social bookmarking systems – Anti-spam [Koutrika et al 2007] – Search& ranking tags [Hotho et al 2006] • Utilize social bookmarks – Visualization [Dubinko et al. 2006] – Summarization [Boydell et al. 2007] – Use tags to help web search: [Heymann et al. 2008]; [Zhou et al. 2008] 4
Research Questions • Can we automatically generate tags for web documents? – Meaningful, compact, relevant • Can we generate tags for other web objects, such as web users? 5
Applications of Automatic Tagging • Summarizing documents/ web objects • Suggest social bookmarks • Refine queries for web search – Finding good queries to a document • Suggest good keywords for online advertising 6
Rest of the Talk • A probabilistic approach to tag generation – Candidate Tag Selection – Web document representation – Tag ranking • Experiments – Web documents tagging; – web user tagging • Summary 7
Our Method User-Generated Corpus (e. g. , Del. icio. us, Wikipedia) ipod nano, data mining, presidential campaign index structure, statistics tutorial, computer science… Candidate tag pool representation data statistics tutorial analysis software model frequent probabilistic algorithm … Web Documents 0. 1599 0. 0752 0. 0660 0. 0372 0. 0311 0. 0310 0. 0233 0. 0188 0. 0173 Multinomial word Distribution Ranking candidate tags data mining 0. 26 statistics tutorial 0. 19 computer science 0. 17 index structure 0. 01 …… ipod nano 0. 00001 presidential campaign 0. 0 …… 8
Candidate Tag Selection • Meaningful, compact, user-oriented • From social bookmarking data – E. g. , Del. icio. us – Single tags that other people used – “phrases” statistically significant bigrams • From other user-generated web contents – E. g. , Wikipedia – Titles of entries in wikipedia 9
Representation of Web Documents • Multinomial distribution of words (unigram language models) – Commonly used in retrieval and text mining • Can be estimated from the content of the document, or from social bookmarks (our approach) – What other people used to tag that document text mining data probabilistic independence model … 0. 16 0. 08 0. 07 0. 04 0. 03 Baseline: Use the top words in that distribution to tag a document 10
Tag Ranking: A Probabilistic Approach • Web documents d a language model • A candidate tag t a language model from its co-occurring tags • Score and rank t by KL-divergence of these two language models Social Bookmark Collection 11
Rewriting and Efficient Computation Bias of using C to represent document d (e. g. , del. icio. us) Bias of using C to represent candidate tag t 1. Can be pre-computed from corpus; 2. Only store those PMI(w, t|C) > 0 12
Tagging Web Users • Summarize the interests and bias of a user • Web user a pseudo document • Estimate a language model from all tags that he used • The rest is similar to web document tagging 13
Experiments • Dataset: – Two-week tagging records from Del. icio. us Time Span Bookmarks Distinct Tags Distinct Users 02/13/07 ~ 02/26/07 579, 652 20, 138 111, 381 – Candidate tags: • Top 15, 000 Significant 2 -grams from del. icio. us; • titles of all wikipedia entries (5, 836, 166 entries, around 48, 000 appeared in del. icio. us) 14
Tagging Web Documents Urls LM p(w|d) Tag = Word Tag. Meaningful, = bigram Relevant, precise relevant http: //kuler. adobe. com/ (158 bookmarks) Too general, sometimes not relevant http: //www. youtube. com/ watch? v=6 gm. P 4 nk 0 EOE (157 bookmarks) color design colour webdesign palette colorscheme tools data, not adobe overfitcolours phrases graphics real picker cor flash web 2. 0 video But sometimes youtube not meaningful web internet xml community youtube revver vodcast primer comunidad participation ethnograpy Meaningful, relevant Tag = wikipedia title But partially adobe color covers good color design colour palettetags colors web color colour design colours Meaningful, inspiration palette cor relevant, webdesign color rgb real xml youtube web 2. 0 youtube video web 2. 0 xml online presentation social video youtube video internet video youtube revver research video vodcast primer p 2 p TV overfit data, not real phrases 15
Tagging Web Documents (Cont. ) Urls http: //pipes. yahoo. com (386 bookmarks) http: //www. miniajax. com/ (349 bookmarks) Too general, sometimes not relevant LM p(w|d) Tag = Word Relevant, precise yahoo pipes rss feeds web 2. 0 yahoo mashup feeds rss programming syndication Meaningful, pipes mashups relevant ajax javascript web 2. 0 webdesign programming code webdev ajax dhtml javascript moo. fx dragdrop phototype autosuggest But sometimes not meaningful Tag = bigram Tag = wikipedia title feeds mashup pipes web 2. 0 yahoo rss web 2. 0 mashup rss api feeds pipes programming ajax code javascript ajax javascript web 2. 0 css ajax javascript programming overfit data, not real phrases pipes yahoo mashups rss syndication mashups Meaningful, blog feeds real relevant, ajax dhtml javascript moo. fx javascript library javascript framework 16
Tagging Web Users LM p(w|d) photography art portraits tools web data, not overfit design real phrases geek User 1 humor programming photography Partially covers blog the interest webdesign security funny User 2 Tag = bigram Tag = wikipedia title art photography portraits digital flickr photoblog photography art photo flickr photography weblog wordpress art photography photoblog portraits photography landscapes flickr art contest geek hack humor programming hacking network programming tweak Meaningful, hacking relevant, real security geek humor sysadmin digitalcamera networking programming geek html geek hacking reference security 17
Tagging Web Users (Cont. ) Users LM p(w|d) Tag = bigram Tag = wikipedia title User 3 games arg tools programming sudoku cryptography software arg games puzzles games internet arg code games sudoku code generator community games arg games research games puzzles storytelling code generator community games User 4 web reference css development rubyonrails tools design rubyonrails web css development brower development editor development forum development firefox javascript tools javascript css webdev xhtml Missed many dhtml good tags css 3 dom 18
Discussions • Using top tags: too general, sometimes not relevant • Ranking tags by labeling language models: – Candidate = Social bookmarking words • Pros: relevant, compact • Cons: ambiguous, not so meaningful – Candidate = Social bookmarking bigrams • Pros: more meaningful, relevant • Cons: overfiting the data, sometimes not real phrases – Candidate = Wikipedia Titles: • Pros: meaningful, relevant real phrases • Cons: biased, missed potential good tags. (Bias(t, C)) 19
Summary • Automatic tagging of web documents and web users • A probabilistic approach based on labeling language models • Effective when the candidate tags are of high quality • Future work: – A robust way of generating candidate tags – Large scale evaluation 20
Thanks! 21
- Slides: 21