Searching Google page rank and anchor text Hits

  • Slides: 9
Download presentation
Searching § § Google: page rank and anchor text Hits: hubs and authorities MSN’s

Searching § § Google: page rank and anchor text Hits: hubs and authorities MSN’s Ranknet: learning to rank Today’s web dragons

How to search: Google’s pagerank v Pagerank v Anchor text rank(~me) = p 1

How to search: Google’s pagerank v Pagerank v Anchor text rank(~me) = p 1 ~me p 2 p 3 r(q) = rank(p) #outlinks(p) C(p, q) r(p) o(p) r = C r r is an eigenvector of C Random surfer model v Broken links (hence ) v Trapping states (adjust C)

Chart of the web v Random surfer vs random searcher New archipelago Milgram’s continent

Chart of the web v Random surfer vs random searcher New archipelago Milgram’s continent Corporate continent 20% of nodes 30% of nodes 20% of nodes Terra incognita 30% of nodes

Google search: anchor text v Pagerank v Anchor text ~me: this is the best

Google search: anchor text v Pagerank v Anchor text ~me: this is the best page ever Google uses: v v v v v In anchor text? In URL? Title Meta tags <h> level Rel font size Capitalization Word pos in doc Secret ingredients you: that is the best page ever ~me: … and weights them according to a secret recipe

HITS: hubs and authorities hub authority(p) hub(x) = = C(x, p) auth(p) hub =

HITS: hubs and authorities hub authority(p) hub(x) = = C(x, p) auth(p) hub = C auth = CT hub = C. CT hub is an eigenvector of C. CT Principal eigenvector strongest community Other eigenvectors other communities

Using HITS: Ask’s Teoma jaguar Web communities jaguar <car> jaguar <animal> jaguar <Mac OS>

Using HITS: Ask’s Teoma jaguar Web communities jaguar <car> jaguar <animal> jaguar <Mac OS> jaguar <auto racing team> jaguar <Jacksonville Jaguars>

Using HITS: Ask’s Teoma Web communities jaguar <car> jaguar <animal> jaguar <Mac OS> jaguar

Using HITS: Ask’s Teoma Web communities jaguar <car> jaguar <animal> jaguar <Mac OS> jaguar <auto racing team> jaguar <Jacksonville Jaguars> Query neighborhood graph (search hits + neighbors) Hub scores (lists of resources) Authority scores (target pages) helps to deal with synonyms pull in other relevant pages (e. g. Toyota is authority for “auto manufacturers” yet doesn’t contain the term)

Learning to rank: MSN’s Ranknet Training set queries with matching documents from human judges

Learning to rank: MSN’s Ranknet Training set queries with matching documents from human judges Discriminant function e. g. weighted sum of features, plus threshold Machine learning learn the weights Apply to real queries 17, 000 queries 10 documents/query human judgement (1– 5) 600 features pairs of docs with same query: which is more highly ranked? train a neural net (1 -layer, 2 -layer) Results? — Pretty good

Today’s web dragons 49% Google 1998 2004 23% Yahoo 1994 1996 Inktomi 2002 Alta.

Today’s web dragons 49% Google 1998 2004 23% Yahoo 1994 1996 Inktomi 2002 Alta. Vista 2003 10% MSN 2005 7% AOL Excite since 1997, Google since 2002 2% Ask (Jeeves) Teoma 2001 Sergey Brin Larry Page