Searching Cite Seer Metadata Using Nutch Larry Reeve

  • Slides: 25
Download presentation
Searching Cite. Seer Metadata Using Nutch Larry Reeve INFO 624 – Information Retrieval Dr.

Searching Cite. Seer Metadata Using Nutch Larry Reeve INFO 624 – Information Retrieval Dr. Lin – Winter 2005

Cite. Seer

Cite. Seer

Cite. Seer n Search Issues Keyword-based full-text search n Boolean search syntax n n

Cite. Seer n Search Issues Keyword-based full-text search n Boolean search syntax n n How to… n search by author name? n search author affiliation? n search by publication date?

Cite. Seer n Example: n Suggested author search approach: n For authors, list all

Cite. Seer n Example: n Suggested author search approach: n For authors, list all variants that appear in citations, separated by “OR“ n Examples: n m jordan or michael jordan or m i jordan or michael i jordan n howard w/2 white or h w/2 white

Cite. Seer – phrase search

Cite. Seer – phrase search

Cite. Seer – term search

Cite. Seer – term search

Goal n Search selected metadata fields Author name n Author affiliation n Publication Date

Goal n Search selected metadata fields Author name n Author affiliation n Publication Date (month, day, year) n Title n Others… n n Increase precision

Methodology - Nutch n An open-source web search engine n Includes crawling, indexing, searching

Methodology - Nutch n An open-source web search engine n Includes crawling, indexing, searching n Technologies: Java, JSP, Tomcat n Extensible new fields n new parsing/indexing facilities n adapt UI for searching n

Methodology - Metadata

Methodology - Metadata

Methodology 1) Split XML file into HTML documents n Each HTML doc contains metadata

Methodology 1) Split XML file into HTML documents n Each HTML doc contains metadata n Allows existing crawler to be used/extended 2) Crawl and index HTML documents on local filesystem 3) Search generated index using JSP page

Methodology Implemented as part of project XML File (100 records) Split Program 100 HTML

Methodology Implemented as part of project XML File (100 records) Split Program 100 HTML Documents Nutch Crawler Nutch Search (JSP) Parse Filter Query Filter Index Filter

XML to HTML Split

XML to HTML Split

Methodology - Split

Methodology - Split

Methodology – Crawl/Index n Requires 2 filters to process metadata n CSParse. Filter n

Methodology – Crawl/Index n Requires 2 filters to process metadata n CSParse. Filter n Parses HTML for metadata values n Implements Nutch n Html. Parse. Filter interface CSIndexing. Filter n Uses metadata generated by Parse. Filter n Adds metadata to index n Implements Nutch Indexing. Filter interface

Parse Filter – extract metadata

Parse Filter – extract metadata

Index Filter

Index Filter

Methodology – Query n Modification of Nutch search page n Change URL from filesystem

Methodology – Query n Modification of Nutch search page n Change URL from filesystem metadata HTML to Cite. Seer n Change to 20 hits, to match Cite. Seer n Query filter n Handles custom fields from index filter n Prefixed with cs_ n Implements Nutch Query. Filter interface

Query Filter

Query Filter

Evaluation n Testing for precision/recall n n 100 documents Stress test n 10, 000

Evaluation n Testing for precision/recall n n 100 documents Stress test n 10, 000 documents n n Approx 10 mins to crawl/index 575, 000 documents in Cite. Seer metadata download n (716, 797 documents in Cite. Seer) n 3. 5 hours to split XML into HTML 12 hours to crawl/index ~551, 000 indexed during crawling n n

Evaluation n Precision & recall n Use first 100 docs (easy to measure recall)

Evaluation n Precision & recall n Use first 100 docs (easy to measure recall) n Issue queries n n Author last name Author first & last name Author affiliation Precision n Use max docs in each system n Issue author search queries to both systems n Measure precision on each page of 20 hits

Evaluation – P & R n Look for all papers where Peter Lee is

Evaluation – P & R n Look for all papers where Peter Lee is an author (1 document) n cs_authorlast: lee n n n Returns 3 documents, all with last name of Lee P=. 33, R=1 cs_authorlast: lee cs_authorfirst: peter n n Returns single document P=1, R=1

Evaluation - Precision n Author search: n Q 1: Peter Lee n n Project:

Evaluation - Precision n Author search: n Q 1: Peter Lee n n Project: cs_authorfirst: peter cs_authorlast: lee Cite. Seer: peter w/2 lee n Q 2: Jeffrey Ullman n n Project: cs_authorfirst: jeffrey cs_authorlast: ullman Cite. Seer: jeffrey w/2 ullman n Q 3: John Smith n n Project: cs_authorfirst: john cs_authorlast: smith Cite. Seer: john w/2 smith

Evaluation - Precision

Evaluation - Precision

Search Demo n Available fields: cs_authorfirst n cs_authorlast n cs_authoraffiliation n cs_pubyear n cs_pubmonth

Search Demo n Available fields: cs_authorfirst n cs_authorlast n cs_authoraffiliation n cs_pubyear n cs_pubmonth n