Nutch Tutorial IST 516 Fall 2010 Dongwon Lee

  • Slides: 21
Download presentation
Nutch Tutorial IST 516 Fall 2010 Dongwon Lee, Ph. D. Wonhong Nam, Ph. D.

Nutch Tutorial IST 516 Fall 2010 Dongwon Lee, Ph. D. Wonhong Nam, Ph. D.

What is Nutch? l Apache has open-source solution for two components of Search Engines

What is Nutch? l Apache has open-source solution for two components of Search Engines l l l Crawler: Nutch Indexer: Lucene Solr Lucene/Solr (merged in 2010) A project headed by Doug Cutting To make an open-source search engine expandable enough to index the entire web (~ billions) Nutch includes l l Java crawler HTML parser + Lucene search/index library + lots more IST 516 2

Features of Nutch l l l l l Robot crawler, can use proxy Includes

Features of Nutch l l l l l Robot crawler, can use proxy Includes hosts via grep, exclusion by host names and suffixes Continuous indexing FTP indexing login option Index logging options Flexible query parsing Includes link-analysis module (mainly for multisite search) Includes approximately fifteen relevance quality adjustment options Caches original page for display IST 516 3

Workflow of Nutch l l There are two paths (index path & query path)

Workflow of Nutch l l There are two paths (index path & query path) through a search engine The index path shows how the index gets filled with documents. l The documents are fed to an analyzer which then transforms them into the appropriate weighted terms (or scores) and passes them to the Index. Writer IST 516 4

Connection Steps l For security reasons, ist 516 server is only accessible from IST’s

Connection Steps l For security reasons, ist 516 server is only accessible from IST’s VLabs l l First, login to IST’s VLabs environment Second, from VLabs, login to ist 516 server IST 516 5

Connecting to VLabs l From Windows/Mac remote-desktop, login to VLabs using your PSU ID/PWD

Connecting to VLabs l From Windows/Mac remote-desktop, login to VLabs using your PSU ID/PWD l Note “UPPSU-ID” for the user-name below IST 516 6

Connecting to ist 516. ist. psu. edu l A UNIX server is prepared for

Connecting to ist 516. ist. psu. edu l A UNIX server is prepared for proj #2 l l Ist 516. ist. psu. edu (130. 203. 136. 10) Can be accessed via SSH protocol only l If not pre-installed, get a SSH client from https: //downloads. its. psu. edu/ "File Transfer” IST 516 7

Connecting to ist 516. ist. psu. edu l If a SSH client is pre-installed

Connecting to ist 516. ist. psu. edu l If a SSH client is pre-installed in VLabs, use it l “Quick connect” use the provided team ID/PWD IST 516 8

Ist 516. ist. psu. edu l Tomcat (Apache’s web server) and Nutch are already

Ist 516. ist. psu. edu l Tomcat (Apache’s web server) and Nutch are already installed in the server l l Under each team's home directory (eg, /home/team-ID/nutch-1. 0) Modify things under "nutch-1. 0/conf" to change the behavior of Nutch as you wish IST 516 9

Running Tomcat and Nutch l l l To start or stop Tomcat server, all

Running Tomcat and Nutch l l l To start or stop Tomcat server, all you need to do is to type: start-tomcat and stop-tomcat To run Nutch, at the command line, just type: nutch or you can provide various parameters like: nutch [parameters] The server has the most of typical UNIX software installed, including: l l l wget: to download things using URL address nano: a small editor which Windows users may find it useful/familiar Emacs: full-fledged powerful UNIX editor IST 516 10

Crawling in Nutch l There are two approaches to crawling: l l l Intranet

Crawling in Nutch l There are two approaches to crawling: l l l Intranet crawling, with the crawl command. Whole-web crawling, with much greater control, using the lower level inject, generate, fetch and updatedb commands Intranet crawling is more suitable for smallscale project IST 516 11

1. Intranet Crawling l l Create a text file, say urlfile. txt, containing some

1. Intranet Crawling l l Create a text file, say urlfile. txt, containing some seed URLs. Eg, http: //pike. psu. edu/ Edit the file conf/crawl-urlfilter. txt and replace MY. DOMAIN. NAME with the name of the domain you wish to crawl l Eg, if you wish to limit the crawl to the pike. psu. edu domain, the line should read: +^http: //([a-z 0 -9]*. )*pike. psu. edu/ l This will include any URLs in the domain pike. psu. edu IST 516 12

1. Intranet Crawling l Edit the file conf/nutch-site. xml accordingly l At least, insert

1. Intranet Crawling l Edit the file conf/nutch-site. xml accordingly l At least, insert the following properties and edit in proper values for the properties: <property> <name>http. agent. name</name> <value>YOUR-CRAWLER-NAME-HERE</value> <description></description> </property> IST 516 13

1. Intranet Crawling l l l Use the crawl command for crawling. Its options

1. Intranet Crawling l l l Use the crawl command for crawling. Its options include: l -dir: names the directory to put the crawl in l -depth: indicates the link depth from the root page that should be crawled l -delay: determines the number of seconds between accesses to each host l -threads: determines the number of threads that will fetch in parallel Eg, a typical call might be: l > nutch crawl urlfile. txt -dir crawl. test -depth 3 >& log IST 516 14

1. Intranet Crawling l l l The indexer uses the downloaded contents to generate

1. Intranet Crawling l l l The indexer uses the downloaded contents to generate an inverted index of all terms and all pages The document set is divided into a set of index segments, each of which is fed to a single searcher process Each searcher also draws upon the Web content from earlier, so it can provide a cached copy of any Web page IST 516 15

2. Internet Crawling l l l More steps are needed than intranet crawling Explore

2. Internet Crawling l l l More steps are needed than intranet crawling Explore it for your proj #2 Refer to: l http: //wiki. apache. org/nutch/Nutch. Tutorial IST 516 16

3. Searching l l Tomcat is installed and each of your group has your

3. Searching l l Tomcat is installed and each of your group has your own webapp directory, which holds the nutch war file To search, put the nutch war file into your servlet container. l l > cp ~/nutch-0. 9/nutch*. war ~/tomcat/webapps/ROOT. war Go to the directory that your crawler created and run the Tomcat server: l l > cd crawl. test > start-tomcat IST 516 17

3. Searching l Connect your browser to: l l http: //ist 516. ist. psu.

3. Searching l Connect your browser to: l l http: //ist 516. ist. psu. edu: 900? ? is your group number Eg, Team 1: http: //ist 516. ist. psu. edu: 9001/ To access this URL, students need to log in to VLabs first and access from there: l l vlabs. up. ist. psu. edu + PSU ID/PWD Refer to VLabs Tutorial for more details: • http: //pike. psu. edu/classes/ist 516/2010 fall/s/slides/vlabs-tutorial. ppt IST 516 18

3. Searching IST 516 19

3. Searching IST 516 19

Editing Nutch Look l To change the look & feel of search interface l

Editing Nutch Look l To change the look & feel of search interface l l Instead, change XML files directly: l l Search. html is automatically generated ~/nutch-1. 0/src/web/pages/en/search. xml ~/nutch-1. 0/src/web/pages/en/about. xml ~/nutch-1. 0/src/web/pages/en/help. xml More details on how to edit Nutch look, see here: l http: //www. stevekallestad. com/wiki/Editing_nutch IST 516 20

Reference l Apache’s Official Nutch Tutorial l l Peter Wang’s Nutch Tutorial l l

Reference l Apache’s Official Nutch Tutorial l l Peter Wang’s Nutch Tutorial l l http: //wiki. apache. org/nutch/Nutch. Tutorial http: //zillionics. com/resources/articles/Nutch. G uide. For. Dummies. htm IST 441’s Nutch Tutorialhttp: //clgiles. ist. psu. edu/IST 441/material s/nutch-lucene/nutch-crawling-and-searching. pdf IST 516 21