CARROT II Collaborative Agentbased Routing and Retrieval of
CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium, 2002 10/24/2002 R. Scott Cost - CADIP, UMBC 1
Mission Serve the current and future information needs of the community through the construction of a powerful yet flexible, highbandwidth distributed IR system, which can integrate information from a variety of sources Create a testbed for research in a variety of IR issues Foster new and ongoing IR research at UMBC, CADIP’s affiliates and sponsor organization 10/24/2002 R. Scott Cost - CADIP, UMBC 2
Reports Presentation will consist of three reports: Project Progress and Status TREC Participation Current Student Research 10/24/2002 R. Scott Cost - CADIP, UMBC 3
1: Project Status Overview Current Status Progress Current Issues Goals Contact Details Summary 10/24/2002 R. Scott Cost - CADIP, UMBC 4
Overview During the past year, the C 2 Project has made substantial progress towards its current goals, and has continued to expand thrive, both in size and in variety of relevant research directions. 10/24/2002 R. Scott Cost - CADIP, UMBC 5
Status Currently, we have: n n A DIR system which is portable, scalable, and which has the potential to support mixed collections of information sources. Nodes for classic IR, web search, crawling. A Java-based IR engine (WONDIR). An integrated version of the Telltale IR engine. 10/24/2002 R. Scott Cost - CADIP, UMBC 6
Progress Since Last Review n n Completion of Telltale integration Advances in WONDIR’s scalability First formal C 2 presentation, in Madrid First TREC participation Since Last Symposium n n n Full, working C 2 system WONDIR IR Engine S. Kallurkar’s Master’s Thesis 10/24/2002 R. Scott Cost - CADIP, UMBC 7
Current Issues Some issues of significant concern are: n n Scalability – Telltale and WONDIR need to index more data, and in less time. Metadata – Needs to be extended to support the integration of and fusion of results from different sources. Semantic Web – How can we use semantic markup in queries and handle it in text? Streams – The logical extension of large, extremely dynamic corpora. 10/24/2002 R. Scott Cost - CADIP, UMBC 8
3/6/12 (from 9/2001) 3: Exercise system and prepare initial results for publication. 6: Expand system. Heavy evaluation, and preparation for debut. 12: Extensions (routing algorithms, fusion, metadata combination…). 10/24/2002 R. Scott Cost - CADIP, UMBC 9
Goals (3/6/12) 3 n n Presentations at TREC Submissions to SIGIR, AAMAS and WWW 6 n n Resolution of scaling problems, indexing 2 G/node easily Integration of semantic markup, ‘magnification’ 12 n n n Successful second round of TREC Integration and fusion of multiple source types Support for data streams 10/24/2002 R. Scott Cost - CADIP, UMBC 10
Summary The C 2 project is making steady progress towards its goal of highbandwith IR from distributed, heterogeneous sources. 10/24/2002 R. Scott Cost - CADIP, UMBC 11
For More Information … For more details on the goals and design of the project, individuals are referred to documents on the Project site: http: //www. csee. umbc. edu/~co st/carrot 2/ C 2 is powered by: n n n Jackal – An Agent Communications Infrastructure. The WONDIR Engine. Telltale. * The C 2 project is supported in part by the U. S. Department of Defense. 10/24/2002 R. Scott Cost - CADIP, UMBC 12
2: TREC Participation Overview TREC’s Web. Track Topic Distillation Approach Results Plans Summary 10/24/2002 R. Scott Cost - CADIP, UMBC 13
Overview This year, C 2 made its first successful entry in the TREC event. 10/24/2002 R. Scott Cost - CADIP, UMBC 14
TREC An annual event, organized by NIST, in which many IR groups gather to test their current system’s ability to solve various IR problems. The TREC event is organized into tracks, each of which focuses on a particular type of problem or data. 10/24/2002 R. Scott Cost - CADIP, UMBC 15
TREC’s Web Track Focus is web data. Data set: a crawl of the. gov domain. n n n 18. 1 Gigabytes 1. 25 Million documents Crawled early 2002 Two tasks: n n Homepage Finding Topic Distillation 10/24/2002 R. Scott Cost - CADIP, UMBC 16
Topic Distillation Given an information need (query), find the best ‘resource page’ for that need. This is not necessarily the page which best matches the contents of the query; value is given to links to other pages of value as well. 10/24/2002 R. Scott Cost - CADIP, UMBC 17
Approach Given a collection of pages and a query: n n n Compute query similarity to each page, using VSM and cosine similarity Consider 1000 top-ranked documents Decorate subcollection with similarities Employ a spreading activation function to propagate relevance Select the top ranked documents in the resulting graph 10/24/2002 R. Scott Cost - CADIP, UMBC 18
Results We submitted 5 runs: n 2 Raw similarity w Flood query to all nodes w Send query to N best nodes n 3 Integrating link topology information w Variations on the same weight equation (last three runs based on similarity computed in first) 10/24/2002 R. Scott Cost - CADIP, UMBC 19
TREC Baseline Run 10/24/2002 R. Scott Cost - CADIP, UMBC 20
Baseline Diff. from Median 10/24/2002 R. Scott Cost - CADIP, UMBC 21
TREC TD Run 10/24/2002 R. Scott Cost - CADIP, UMBC 22
Plans for the Future In preparation for next year’s competition: n n n Improve scale Investigate work in propagating information (this was a new area for us) Employ ideas from ongoing work in scent and credibility. 10/24/2002 R. Scott Cost - CADIP, UMBC 23
Summary For a first time entry, C 2 did reasonably well n n Performance similar to median for baseline Performance below median with topology information 10/24/2002 R. Scott Cost - CADIP, UMBC 24
3: Student Research Overview Highlights Ongoing Research Spotlight on: n n n Data Fusion Document Summarization Query Caching Open Questions Summary 10/24/2002 R. Scott Cost - CADIP, UMBC 25
Overview The C 2 Project is a multi-faceted effort which encompasses a broad range of research questions. Many of these questions are currently being investigated by UMBC students, both within the context of the project’s goals, and as part of their own academic research. 10/24/2002 R. Scott Cost - CADIP, UMBC 26
Highlights Srikanth Kallurkar Yongmei Shi Hemali Majithia Christopher James Akshay Java Sachin Bhatkar Dayn Harum Sowjanya Rajavaram Matt Siegel Drew Ogle 10/24/2002 R. Scott Cost - CADIP, UMBC 27
Highlights: S. Kallurkar Ph. D. Student Topic: Results Fusion (Masters Topic: Clustering) C 2 Technical Lead Wrote the first C 2 Masters Thesis, on online clustering in a DIR system. 10/24/2002 R. Scott Cost - CADIP, UMBC 28
Highlights: Y. Shi Ph. D. Student Research: Document Summarization for Metadata expert in residence Developer – C 2 Web Search Agent Implemented first infrastructure prototype 10/24/2002 R. Scott Cost - CADIP, UMBC 29
Highlights: H. Majithia M. S. Student Topic: Query Caching in DIR Collection Librarian, TREC Liason Testing and Evaluation Developer - Query/Client agents 10/24/2002 R. Scott Cost - CADIP, UMBC 30
Highlights: C. James M. S. Student Topic: Inferring Document Credibility Java Performance Task Force Developer – GUI Query Interfaces 10/24/2002 R. Scott Cost - CADIP, UMBC 31
Highlights: A(kshay). Java M. S. Student Topic: Information Scent for Web Search Recently completed an internship at PARC Heading C 2 task force on Java performance Developer - C 2 Web Crawler agent 10/24/2002 R. Scott Cost - CADIP, UMBC 32
Highlights: S. Bhatkar M. S. Student Topic: Query Expansion/Enhancement Java Performance Task Force 10/24/2002 R. Scott Cost - CADIP, UMBC 33
Highlights: D. Harum M. S. Student Topic: Java Real Time Perfomance Monitoring (applied to WONDIR) Integrated monitoring code into SIRE file system, evaluated caching strategies. 10/24/2002 R. Scott Cost - CADIP, UMBC 34
Highlights: S. Rajavarum M. S. Student Topic: Protocols for Interaction in a Multi-Agent System Java Performance Task Force Newest member of the C 2 team 10/24/2002 R. Scott Cost - CADIP, UMBC 35
Highlights: M. Siegel M. S. Student Employed by the Sponsor Worked on C 2/Telltale integration Developer – Distributed file system layer 10/24/2002 R. Scott Cost - CADIP, UMBC 36
Highlights: T. Laufert M. S. Student Employed by the Sponsor Developer - Document flow visualization tools for C 2 10/24/2002 R. Scott Cost - CADIP, UMBC 37
Highlights: D. Ogle Undergraduate Student Resident Telltale Engineer Integrated Telltale into the C 2 system. Also provides Telltale support for ID group. 10/24/2002 R. Scott Cost - CADIP, UMBC 38
Spotlight: Data Fusion Results fusion is an essential component in the success of a distributed IR system. It is especially difficult when information sources in the system vary widely in content and form. 10/24/2002 R. Scott Cost - CADIP, UMBC 39
Spotlight: Document Summarization Successful collection selection and comparison depends on accurate metadata. Document summarization may lead us to the construction of more compact and richer metadata collection descriptions. 10/24/2002 R. Scott Cost - CADIP, UMBC 40
Spotlight: Query Caching By caching query results and returning approximate answers, we hope to reduce the overhead of repeatedly processing similar queries in a distributed environment. 10/24/2002 R. Scott Cost - CADIP, UMBC 41
Open Issues Semantic Web: There is much to be done still in integrating issues of the semantic web into C 2. n n Indexing and enhancement of marked data Use of markup in routing and fusion Presentation of mixed-type results Data streams 10/24/2002 R. Scott Cost - CADIP, UMBC 42
Summary In the past 2+ years, the C 2 project has generated and sustained significant interest and research in both practical and theoretical aspects of Distributed Information Retrieval. By the end of the Fall semester, C 2 will have earned 3 Masters degrees, and will have contributed to several others. 10/24/2002 R. Scott Cost - CADIP, UMBC 43
Bibliography Cost et al. , CARROT II: Collaborative Agent-based Routing and Retrieval of Text, Proceedings of the Fall 2001 CADIP Research Symposium. Cost et al. , Integrating Distributed Information Sources with CARROT II, Proceedings of the Workshop on Cooperative Information Agents (CIA), 2002. Kallurkar, Document Migration in Distributed Information Retrieval, Masters Thesis for UMBC CSEE, 2002. In Preparation: Cost et al. , ---, Proceedings of the Fall 2002 CADIP Research Symposium. Cost, WONDIR. Harum, ---, Masters Project for UMBC CSEE. Java et al. , Integrating Web Sources with Distributed IR. Kallurkar et al. , Comparison of Results Fusion Methods. Majithia, Investigation of Caching Mechanisms in Multi-Agent Based Architecture for Distributed Information Retrieval Systems, Masters Thesis for UMBC CSEE. 10/24/2002 R. Scott Cost - CADIP, UMBC 44
Bibliography… Also of note: n n n T. Oates, V. Bhat, V. Shanbhag, Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text, Proceedings of WIDM, CIKM ’ 02. U. Shah, Information Retrieval on the Semantic Web, Masters Thesis, UMBC CSEE, Spring 2002. U. Shah, T. Finin, A. Joshi, R. S. Cost, J. Mayfield, Information Retrieval on the Semantic Web, Proceedings CIKM ’ 02. 10/24/2002 R. Scott Cost - CADIP, UMBC 45
- Slides: 45