CS 533 Information Retrieval Dr Michal Cutler Lecture

Webor z. Web-based search tool for Organization Retrieval z. Based on the vector space

Webor z. Consists of two components y. Indexing engine y. Search engine

Language z. The indexing engine is a C++ program z. The search engine is

Indexing engine z. Robot-based z. Builds 25 inverted index files z. Builds a webpage

Search engine z. Reads inverted lists for each indexed query term, and zcomputes similarity

Webor z. How to make use of the structure information in webpage collections to

HTML tags z. In current implementation many tags are ignored. Webor checks only whether

Main Idea z. Intuitively, terms in title, header, emphasized, or lists are important z.

Anchor descriptions z. Authors include in an Anchor tag, a description of the document,

Anchor descriptions z. May provide good synonymous and related terms z. Enable to retrieve

The six Classes z. Plain Text, z. Title, z. H 1 -H 2, z.

The anchor class z. Includes terms in the anchor tag of hyperlinks to the

Indexing engine Read a list of URL seeds and store in HP index Read

Indexing engine while more “child links” in HP do extract anchor terms for “child”

Indexing engine add “parent” to “child” HP in the HP index Build keyword index

Build keyword index Extract strings between HTML tags and assign to one of 6

Class assignment (V 1) z. Assign keywords within <TITLE> and </TITLE> to Title class

A token in Webor z. Sequence of non blank characters z. Truncate “‘s” z.

What is a token? z. If last character is not alphabetic discard character y“numbers,

Keywords in Webor z. Stem token with Porter’s stemmer z. Make token lower case

The keyword index (V 1) z. Binary search trees z. Reside in main memory

Calculating Cosine similarity z. The idf of a term is only known at the

Limitations z. Cannot search for phrases z. Numbers are discarded z. Engi. Net, OS/2

Problems with indexing engine in V 1 z. Efficiency (V 1) z. Names (unless

Efficiency z. Binary search trees can be skewed z. Writing 25 files can be

Typing errors z“engineeringwith”- is an index term z“engineer-ing” becomes engineerz“engeneer” is stemmed to “engen”

URLs z. Duplicates (some) home pages with non identical URLs (fixed in V 2)

The files z. HP (homepage) index yrecord for each webpage z 25 keyword index

The web page records zurl address, webpage id, zwebpage title, znumber unique keywords, zanchor

Keyword record z. The keyword (and ID) z df - no. documents with keyword

The inverted list z. A sequence of (ID, TFV) where z. TFV is a

The search engine Read the HP index file Read query string while string not

The search engine get token convert to keyword if in index get keyword record

The weight of a term z. Webor uses the 6 Class Importance Values computed

Normal z. In this case results are identical to ignoring HTML tags and the

Creating a test bed z. Web pages: A snap shot of the Binghamton University

The 20 queries web-based retrieval concert and music neural network intramural sports master thesis

A Genetic Algorithm for finding the optimal CIV. z. The initial population has 30

The Genetic Algorithm z. Crossover ydone for each consecutive pair CIVs, with probability 0.

The Genetic Algorithm z Mutation yperformed on each CIV with probability 0. 1. y.

The Genetic Algorithm z The fitness function y. A CIV has an initial fitness

The Genetic Algorithm z. Reproduction y. Wheel of fortune scheme to select the parent

The Genetic Algorithm z. Termination y. The algorithm terminates after 25 generations and the

Experimental Results Classes: title, header, list, strong, anchor, plain Queries Opt. CIV Normal New

Conclusions z. Anchor and strong are most important z. Header is also important z.

Future work z. Webor has the potential to substantially improve the retrieval effectiveness. z.

Slides: 49

Download presentation

CS 533 Information Retrieval Dr. Michal Cutler Lecture #16 March 30, 2000

Webor z. Web-based search tool for Organization Retrieval z. Based on the vector space model

Webor z. Consists of two components y. Indexing engine y. Search engine

Language z. The indexing engine is a C++ program z. The search engine is a Common Gate Interface program written in C++

Indexing engine z. Robot-based z. Builds 25 inverted index files z. Builds a webpage index with: y doc id, title, URL, y number unique terms, y number “parents”, y parent doc ids

Search engine z. Reads inverted lists for each indexed query term, and zcomputes similarity z. Returns a list of web pages z. Ranks web pages based on Cosine similarity

Webor z. How to make use of the structure information in webpage collections to improve retrieval.

HTML tags z. In current implementation many tags are ignored. Webor checks only whether a term appears in the: ztitle or zheaders or zis emphasized (underscore, italics or bold) or zinside a list item z. All other appearances “plain”

Main Idea z. Intuitively, terms in title, header, emphasized, or lists are important z. Storing occurrence information in the index and zassigning importance values to their appearance may improve retrieval

Anchor descriptions z. Authors include in an Anchor tag, a description of the document, in addition to its URL. z. Descriptions are the perception of these authors about the contents of the document

Anchor descriptions z. May provide good synonymous and related terms z. Enable to retrieve documents that could not be retrieved otherwise. z. Using an optimal importance value for anchor terms may improve the ranking

The six Classes z. Plain Text, z. Title, z. H 1 -H 2, z. H 3 -H 6, z. Strong, and z. Anchor. z. New version uses different 6 classes

The anchor class z. Includes terms in the anchor tag of hyperlinks to the document z. This means that the document is augmented by descriptions that occur in other documents

The tags and the classes (V 1)

Modified set of tags (V 2)

Indexing engine Read a list of URL seeds and store in HP index Read a list of seed domains while unread URLs do Fetch a home page, save title and doc ID in HP index

Indexing engine while more “child links” in HP do extract anchor terms for “child” HP Extract URL and if new store in HP index

Indexing engine add “parent” to “child” HP in the HP index Build keyword index

Build keyword index Extract strings between HTML tags and assign to one of 6 classes while not empty string do Extract a token Transform token to a keyword and update index file

Class assignment (V 1) z. Assign keywords within <TITLE> and </TITLE> to Title class z. Then, assign keywords within <H 1> and </H 1> or <H 2> and </H 2> to H 1 -H 2 class z. Then assign to H 3 -H 6, and then to Strong class z. Remaining terms - Plain class

A token in Webor z. Sequence of non blank characters z. Truncate “‘s” z. Discard one character token

What is a token? z. If last character is not alphabetic discard character y“numbers, ” to “numbers” z. Discard token with nonalphabetic characters z. Discard token if a capital letter occurs in middle of noncapitalized word

Keywords in Webor z. Stem token with Porter’s stemmer z. Make token lower case z. Add to index if not stop word

The keyword index (V 1) z. Binary search trees z. Reside in main memory z. At end copied to disk z. Only about 1000 web pages can be indexed z. Eventually, all index files and HP files must be merged

Calculating Cosine similarity z. The idf of a term is only known at the end of the merge z. To compute the “length” of the document vector the weight tf*idf of each document term must be used z. Now the “length” formula can be computed and saved in the HP index

Limitations z. Cannot search for phrases z. Numbers are discarded z. Engi. Net, OS/2 are discarded z. All terms lower case

Problems with indexing engine in V 1 z. Efficiency (V 1) z. Names (unless all capital) are stemmed (V 1) z. Hyphens (V 1) z. Typos z. URLs (V 1)

Efficiency z. Binary search trees can be skewed z. Writing 25 files can be slow because of seek times z. V 2 uses B-trees. The B tree is partially stored in main memory

Typing errors z“engineeringwith”- is an index term z“engineer-ing” becomes engineerz“engeneer” is stemmed to “engen”

URLs z. Duplicates (some) home pages with non identical URLs (fixed in V 2) z. Down servers not indexed (fixed in V 2)

The files z. HP (homepage) index yrecord for each webpage z 25 keyword index files y. A, B, …, Y z. Each file ya sequence of keyword records

The web page records zurl address, webpage id, zwebpage title, znumber unique keywords, zanchor terms used by “parents”, ztotal number of “parents”, zlist of “parent” IDs

Keyword record z. The keyword (and ID) z df - no. documents with keyword z. An inverted list

The inverted list z. A sequence of (ID, TFV) where z. TFV is a vector of size 6 that contains the term frequencies for the 6 classes

The search engine Read the HP index file Read query string while string not empty do

The search engine get token convert to keyword if in index get keyword record Compute similarity Display in nonincreasing similarity

The weight of a term z. Webor uses the 6 Class Importance Values computed by our experiments (CIV) z. The weight of a term is computed by

Normal z. In this case results are identical to ignoring HTML tags and the parent anchors

Creating a test bed z. Web pages: A snap shot of the Binghamton University site in Dec. 1996 (about 4, 600 pages; after removing duplicates, about 3, 000 pages). z. Queries: 20 queries were created. z. For each query, (manually) identify the documents relevant to the query.

The 20 queries web-based retrieval concert and music neural network intramural sports master thesis in geology cognitive science prerequisite of algorithm campus dining handicap student help career development promotion guideline non-matriculated admissions grievance committee student associations laboratory in electrical engineering research centers anthropology chairman engineering program computer workshop papers in philosophy computer and cognitive system

A Genetic Algorithm for finding the optimal CIV. z. The initial population has 30 CIVs. y 25 are randomly generated (range [1, 15]) y 5 are “good” CIVs from manual screening. z. Each new generation of CIVs is produced by executing: crossover, mutation, and reproduction.

The Genetic Algorithm z. Crossover ydone for each consecutive pair CIVs, with probability 0. 75. ya single random cut for each selected pair Example: old pair new pair (1, 4, 2, 1) (2, 3, 1, 2, 5, 1) cut (2, 3, 2, 1) (1, 4, 1, 2, 5, 1)

The Genetic Algorithm z Mutation yperformed on each CIV with probability 0. 1. y. When mutation is performed, each CIV component is either decreased or increased by one with equal probability, subject to range conditions of each component. Example: If a component is already 15, then it cannot be increased.

The Genetic Algorithm z The fitness function y. A CIV has an initial fitness of x 0 when the 11 -point average precision is less than 0. 22. x(11 -point average precision - 0. 22), otherwise. y. The final fitness is its initial fitness divided by the sum of the initial fitnesses of all the CIVs in the current generation. xeach fitness is between 0 and 1 xthe sum of all fitness values is 1

The Genetic Algorithm z. Reproduction y. Wheel of fortune scheme to select the parent population. y. The scheme selects fit CIVs with high probability and unfit CIVs with low probability. y. The same CIV may be selected more than once.

The Genetic Algorithm z. Termination y. The algorithm terminates after 25 generations and the best CIV obtained is reported as the optimal CIV. y. The 11 -point average precision by the optimal CIV is reported as the performance of the CIV.

Experimental Results Classes: title, header, list, strong, anchor, plain Queries Opt. CIV Normal New Improvement 1 st 10 281881 0. 182 0. 254 39. 6% 2 nd 10 271881 0. 172 0. 255 48. 3% all 251881 0. 177 0. 254 43. 5%

Conclusions z. Anchor and strong are most important z. Header is also important z. Title is only slightly more important than list and plain

Future work z. Webor has the potential to substantially improve the retrieval effectiveness. z. Results are too preliminary. Need to y. Expand the set of queries in the test bed y. Use other Web page collections