20 May 2010 LREC 2010 Building a DomainSpecific
20 May 2010 LREC 2010 Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones School of Computing, Dublin City University, Ireland
Outline CNGL Objective Data collection preparation and overview IR test collection design Baseline Experiments Summary
CNGL Centre of Next Generation Localisation (CNGL) 4 Universities: DCU, TCD, UCD, and UL Team: 120 Ph. D students, Post. Docs, and PIs Supported by Science Foundation of Ireland (SFI) 9 Industrial Partners: IBM, Microsoft, Symantec, … Objective: Automation of the localisation process Technologies: MT, AH, IR, NLP, Speech, and Dev.
Objective 1. 2. 3. 4. Create a collection of data that is: Suitable for IR tasks Suitable for other research fields (AH, NLP) Large enough to produce conclusive results Associated with defined evaluation strategies Prepare the collection from freely available data You. Tube Domain specific (Basketball) Build standard IR test collection (document set + topics set + relevance assessment)
You. Tube Videos Features Posting User Descriptio n Tags Document Posting date Category - Video URL - Video Title Comment s Responde d Videos Related Videos Number of Favorited Length Number of Ratings Number of Views
Methodology for Crawling Data 50 NBA related queries used to search You. Tube First 700 results per query crawled with related videos Crawled pages parsed and metadata extracted. Extracted data represented in XML format Non-sport category results filtered out Used Queries: NBA - NBA Highlights - NBA All Starts - NBA fights Top ranked 15 NBA players in 2008 + Jordan + Shaq 29 NBA teams
Data Collection Overview Crawled video pages: 61, 340 pages Max crawled related/responded video pages: 20 Max crawled comments for a given video page: 500 Comments associated with contributing user’s ID Crawled user profiles ≈ 250 k
XML sample
Topics Creation 40 topics (queries) created Specific topics related to NBA TREC topic = query (title) + description + narrative <title>Michael Jordan best dunks</title> <description>Find the best dunks through the career of Michael Jordan in NBA. It can be a collection of dunks in matches, or dunk contest he participated in. </description> <narrative>A relevant video should contain at least one dunk for Jordan. Videos of dunks for other players are not relevant. And other plays for Jordan other than dunks are not relevant as well</narrative>
Relevance Assessment 4 indexes created: Title +Tags Title + Tags + Description + Related videos titles 5 different retrieval models used 20 different result lists, each contains 60 documents Result lists merged with random ranking 122 to 466 documents assessed per topic 1 to 125 relevant documents per topic (avg. = 23)
Baseline Experiments Search 4 different indexes: Title +Tags Title + Tags + Description + Related videos titles Indri retrieval model used to rank results 1000 results retrieved for each search Mean average precision (MAP) used to compare the results
Results
Summary (new language resource) Sentiment Analysis NER Metadata IR test set Tags Comments Ratings AH/Personalisation 61, 340 XML docs 250, 000 User profiles 40 topics + rel. assess. Videos # Views Reranking using ML Multimedia processing
Questions & Answers Q: Is this collection available for free? A: No Q: Nothing could be provided? A: Scripts + Topics + Rel. assess. (needs updating) Q: Any other questions? A: …
Thank you
You. Tube Statistics (1/8) Min Max 13/09/2005 03/03/2009
You. Tube Statistics (2/8) Min Max Mean Median Std Dev 0 84 12 10 10
You. Tube Statistics (3/8) Min Max Mean Median Std Dev 0 21, 710, 757 35, 707 3, 329 221, 091
You. Tube Statistics (4/8) Min Max Mean Median Std Dev 0 23, 147 58 6 328
You. Tube Statistics (5/8) Min Max Mean Median Std Dev 00: 00 02: 38: 20 00: 02: 53 00: 02: 10 00: 02: 54
You. Tube Statistics (6/8) Min Max Mean Median Std Dev 0 27, 029 52 8 303
You. Tube Statistics (7/8) Min Max Mean Median Std Dev 0 5 4 5 1
You. Tube Statistics (8/8) Min Max Mean Median Std Dev 0 72, 230 94 7 687
You. Tube Statistics (9/9) Min Max Mean Median Std Dev 0 232 0 0 2
- Slides: 24