Nutch Tutorial IST 516 Fall 2010 Dongwon Lee

What is Nutch? l Apache has open-source solution for two components of Search Engines

Features of Nutch l l l l l Robot crawler, can use proxy Includes

Workflow of Nutch l l There are two paths (index path & query path)

Connection Steps l For security reasons, ist 516 server is only accessible from IST’s

Connecting to VLabs l From Windows/Mac remote-desktop, login to VLabs using your PSU ID/PWD

Connecting to ist 516. ist. psu. edu l A UNIX server is prepared for

Connecting to ist 516. ist. psu. edu l If a SSH client is pre-installed

Ist 516. ist. psu. edu l Tomcat (Apache’s web server) and Nutch are already

Running Tomcat and Nutch l l l To start or stop Tomcat server, all

Crawling in Nutch l There are two approaches to crawling: l l l Intranet

1. Intranet Crawling l l Create a text file, say urlfile. txt, containing some

1. Intranet Crawling l Edit the file conf/nutch-site. xml accordingly l At least, insert

1. Intranet Crawling l l l Use the crawl command for crawling. Its options

1. Intranet Crawling l l l The indexer uses the downloaded contents to generate

2. Internet Crawling l l l More steps are needed than intranet crawling Explore

3. Searching l l Tomcat is installed and each of your group has your

3. Searching l Connect your browser to: l l http: //ist 516. ist. psu.

Editing Nutch Look l To change the look & feel of search interface l

Reference l Apache’s Official Nutch Tutorial l l Peter Wang’s Nutch Tutorial l l

Slides: 21

Download presentation

Nutch Tutorial IST 516 Fall 2010 Dongwon Lee, Ph. D. Wonhong Nam, Ph. D.

What is Nutch? l Apache has open-source solution for two components of Search Engines l l l Crawler: Nutch Indexer: Lucene Solr Lucene/Solr (merged in 2010) A project headed by Doug Cutting To make an open-source search engine expandable enough to index the entire web (~ billions) Nutch includes l l Java crawler HTML parser + Lucene search/index library + lots more IST 516 2

Features of Nutch l l l l l Robot crawler, can use proxy Includes hosts via grep, exclusion by host names and suffixes Continuous indexing FTP indexing login option Index logging options Flexible query parsing Includes link-analysis module (mainly for multisite search) Includes approximately fifteen relevance quality adjustment options Caches original page for display IST 516 3

Workflow of Nutch l l There are two paths (index path & query path) through a search engine The index path shows how the index gets filled with documents. l The documents are fed to an analyzer which then transforms them into the appropriate weighted terms (or scores) and passes them to the Index. Writer IST 516 4

Connection Steps l For security reasons, ist 516 server is only accessible from IST’s VLabs l l First, login to IST’s VLabs environment Second, from VLabs, login to ist 516 server IST 516 5

Connecting to VLabs l From Windows/Mac remote-desktop, login to VLabs using your PSU ID/PWD l Note “UPPSU-ID” for the user-name below IST 516 6

Connecting to ist 516. ist. psu. edu l A UNIX server is prepared for proj #2 l l Ist 516. ist. psu. edu (130. 203. 136. 10) Can be accessed via SSH protocol only l If not pre-installed, get a SSH client from https: //downloads. its. psu. edu/ "File Transfer” IST 516 7

Connecting to ist 516. ist. psu. edu l If a SSH client is pre-installed in VLabs, use it l “Quick connect” use the provided team ID/PWD IST 516 8

Ist 516. ist. psu. edu l Tomcat (Apache’s web server) and Nutch are already installed in the server l l Under each team's home directory (eg, /home/team-ID/nutch-1. 0) Modify things under "nutch-1. 0/conf" to change the behavior of Nutch as you wish IST 516 9

Running Tomcat and Nutch l l l To start or stop Tomcat server, all you need to do is to type: start-tomcat and stop-tomcat To run Nutch, at the command line, just type: nutch or you can provide various parameters like: nutch [parameters] The server has the most of typical UNIX software installed, including: l l l wget: to download things using URL address nano: a small editor which Windows users may find it useful/familiar Emacs: full-fledged powerful UNIX editor IST 516 10

Crawling in Nutch l There are two approaches to crawling: l l l Intranet crawling, with the crawl command. Whole-web crawling, with much greater control, using the lower level inject, generate, fetch and updatedb commands Intranet crawling is more suitable for smallscale project IST 516 11

1. Intranet Crawling l l Create a text file, say urlfile. txt, containing some seed URLs. Eg, http: //pike. psu. edu/ Edit the file conf/crawl-urlfilter. txt and replace MY. DOMAIN. NAME with the name of the domain you wish to crawl l Eg, if you wish to limit the crawl to the pike. psu. edu domain, the line should read: +^http: //([a-z 0 -9]*. )*pike. psu. edu/ l This will include any URLs in the domain pike. psu. edu IST 516 12

1. Intranet Crawling l Edit the file conf/nutch-site. xml accordingly l At least, insert the following properties and edit in proper values for the properties: <property> <name>http. agent. name</name> <value>YOUR-CRAWLER-NAME-HERE</value> <description></description> </property> IST 516 13

1. Intranet Crawling l l l Use the crawl command for crawling. Its options include: l -dir: names the directory to put the crawl in l -depth: indicates the link depth from the root page that should be crawled l -delay: determines the number of seconds between accesses to each host l -threads: determines the number of threads that will fetch in parallel Eg, a typical call might be: l > nutch crawl urlfile. txt -dir crawl. test -depth 3 >& log IST 516 14

1. Intranet Crawling l l l The indexer uses the downloaded contents to generate an inverted index of all terms and all pages The document set is divided into a set of index segments, each of which is fed to a single searcher process Each searcher also draws upon the Web content from earlier, so it can provide a cached copy of any Web page IST 516 15

2. Internet Crawling l l l More steps are needed than intranet crawling Explore it for your proj #2 Refer to: l http: //wiki. apache. org/nutch/Nutch. Tutorial IST 516 16

3. Searching l l Tomcat is installed and each of your group has your own webapp directory, which holds the nutch war file To search, put the nutch war file into your servlet container. l l > cp ~/nutch-0. 9/nutch*. war ~/tomcat/webapps/ROOT. war Go to the directory that your crawler created and run the Tomcat server: l l > cd crawl. test > start-tomcat IST 516 17

3. Searching l Connect your browser to: l l http: //ist 516. ist. psu. edu: 900? ? is your group number Eg, Team 1: http: //ist 516. ist. psu. edu: 9001/ To access this URL, students need to log in to VLabs first and access from there: l l vlabs. up. ist. psu. edu + PSU ID/PWD Refer to VLabs Tutorial for more details: • http: //pike. psu. edu/classes/ist 516/2010 fall/s/slides/vlabs-tutorial. ppt IST 516 18

3. Searching IST 516 19

Editing Nutch Look l To change the look & feel of search interface l l Instead, change XML files directly: l l Search. html is automatically generated ~/nutch-1. 0/src/web/pages/en/search. xml ~/nutch-1. 0/src/web/pages/en/about. xml ~/nutch-1. 0/src/web/pages/en/help. xml More details on how to edit Nutch look, see here: l http: //www. stevekallestad. com/wiki/Editing_nutch IST 516 20

Reference l Apache’s Official Nutch Tutorial l l Peter Wang’s Nutch Tutorial l l http: //wiki. apache. org/nutch/Nutch. Tutorial http: //zillionics. com/resources/articles/Nutch. G uide. For. Dummies. htm IST 441’s Nutch Tutorialhttp: //clgiles. ist. psu. edu/IST 441/material s/nutch-lucene/nutch-crawling-and-searching. pdf IST 516 21