HITS Hypertext Induced Topic Selection Gyozo Gidofalvi Uppsala

  • Slides: 8
Download presentation
HITS Hypertext Induced Topic Selection Gyozo Gidofalvi Uppsala Database Laboratory Gyozo Gidofalvi

HITS Hypertext Induced Topic Selection Gyozo Gidofalvi Uppsala Database Laboratory Gyozo Gidofalvi

Idea § Given a set of web pages • that are all concerned with

Idea § Given a set of web pages • that are all concerned with the same topic àwe want to find • the most interesting pages • by examining the internal link structure in the set àwe want to find • the pages that are most likely to guide us to an interesting pages 2020 -12 -07 Gyozo Gidofalvi 2

Foundation § Identify Hubs and Authorities § Definition is mutually recursive: • A good

Foundation § Identify Hubs and Authorities § Definition is mutually recursive: • A good hub is pointing to good authorities • A good authority is pointed to by good hubs § The hub value of a site is • the sum of the authority values of the sites that the site is pointing to. § The authority value of a site is • the sum of the hub values of the sites that points to the site. 2020 -12 -07 Gyozo Gidofalvi 3

Pseudo-code 1. Find a set of pages about a given subject 2. 3. 4.

Pseudo-code 1. Find a set of pages about a given subject 2. 3. 4. 5. § You may use an existing search engine (such as Google) § In the assignment, you are provided a bunch of pages with links Preprocess the link structure Initialize hub and authority vectors Normalize the vectors to length 1 Calculate the new authority vector based on the link structure and the hub vector 6. Calculate the new hub vector based on the link structure and the authority vector 7. If the new values of the hub and authority vectors are similar enough to the old ones we are done, otherwise repeat from 4 8. Sort the vectors and find the top authorities and hubs 2020 -12 -07 Gyozo Gidofalvi 4

Calculating the hub and authority vectors § First we initialize the hub and authority

Calculating the hub and authority vectors § First we initialize the hub and authority vector to some value. • What initial values are appropriate? • Does it matter what we initialize to? § Next, we calculate the new hub and authority vectors using the formulas § Does it matter which order these calculations happen? § Do we need to normalize the vectors in each iteration? § How do we know when to stop? 2020 -12 -07 Gyozo Gidofalvi 5

Preprocessing § Preprocessing will improve the accuracy o § Several links may point to

Preprocessing § Preprocessing will improve the accuracy o § Several links may point to the same page; • http: //www. it. uu. se/index. html • www. it. uu. se § Remove site-internal links as this can make a site seem more important than it really is. § Remove links to sites for which we do not know the link structure. 2020 -12 -07 Gyozo Gidofalvi 6

The assignment § You will mine four different link structures for four different queries.

The assignment § You will mine four different link structures for four different queries. • We have done the web crawling and some of the preprocessing for you! § Input files are on the lab course web page § However, you must • Do some preprocessing yourselves § Directions for pre-processing are on the lab course web page • Validate your implementation § Think of how to verify your solution § Your validation does not have to be fancy • not even automated § At least, implement the test case on the following slide, and see what output it gives you. • Make sure that the test case output is reasonable 2020 -12 -07 Gyozo Gidofalvi 7

Example (test case) § Rank the pages according to hub and authority value in

Example (test case) § Rank the pages according to hub and authority value in this link structure: 2020 -12 -07 a b c d Gyozo Gidofalvi 8