Knowledge Management with Documents Qiang Yang HKUST Thanks

Keyword Extraction n Goal: n n n given N documents, each consisting of words,

Stop Words and Stemming n From a given Stop Word List n n n

Resolving Power of Word Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of

Stemming n The next task is stemming: transforming words to root form n n

Thesaurus Rules n A thesaurus aims at n classification of words in a language

Thesaurus Rules can also be Learned n From a search engine query log n

The Vector-Space Model n n T distinct terms are available; call them index terms

The Vector-Space Model n Assumptions: words are uncorrelated Given: 1. N documents and a

Graphic Representation Example: D 1 = 2 T 1 + 3 T 2 +

Similarity Measure - Inner Product n n n Similarity between documents Di and query

Inner Product -- Examples Binary: nt ion n re r e u m at

Properties of Inner Product n n The inner product similarity is unbounded Favors long

Cosine Similarity Measures t 3 n n Cosine similarity measures the 1 cosine of

Cosine Similarity: an Example D 1 = 2 T 1 + 3 T 2

Document and Term Weights n Document term weights are calculated using frequencies in documents

Term Weight Calculations n Weight of the jth term in ith document: dij =

An example of TF n n Document=(A Computer Science Student Uses Computers) Vector Model

Inverse Document Frequency n n Dfj gives a the number of times term j

TF IDF n n n dij = (tfij /maxl{tflj}) idfj = (tfij /maxl {tflj})

Implementation based on Inverted Files n n In practice, document vectors are not stored

A Simple Search Engine Now we have got enough tools to build a simple

Remaining Questions n n How to crawl? How to evaluate the result n Given

Measurement n n Let M documents be returned out of a total of N

Entire document collection Relevant documents Retrieved documents relevant irrelevant Retrieval Effectiveness - Precision and

Precision and Recall n Precision n Recall n n evaluates the correlation of the

Relationship between Recall and Precision Return relevant documents but miss many useful ones too

Fallout Rate n Problems with precision and recall: n n n A query on

Total Number of Relevant Items n n In an uncontrolled environment (e. g. ,

Computation of Recall and Precision Suppose: total no. of relevant docs = 5 R=1/5=0.

precision Computation of Recall and Precision 1 1. 0 2 4 0. 8 3

Compare Two or More Systems n n n Computing recall and precision values for

The TREC Benchmark n n n TREC: Text Retrieval Conference Originated from the TIPSTER

Interactive Search Engines n Aims to improve their search results incrementally, often applies to

Query Reformulation n Based on user’s feedback on returned results n n n Documents

Query Modification n n Using the previously identified relevant and nonrelevant document set DR

Example (Cont. ) New Similarity Scores: Sim(Q’D 1)=(5. 75 2)+(0. 5 1)+(4 2)+(0 0)+(0.

Link Based Search Engines Qiang Yang HKUST 40

Search Engine Topics n Text-based Search Engines n Document based n n Ranking: TF-IDF,

The Page. Rank Algorithm n Fundamental question to ask n n Information Retrieval n

The Google Crawler Algorithm n “Efficient Crawling Through URL Ordering”, n Junghoo Cho, Hector

Back Link Metric IB(P)=3 Web Page P n n IB(P) = total number of

Page Rank Metric C=2 T 1 Let 1 -d be probability that user randomly

Matrix Formulation n Consider a random walk on the web (denote IR(P) by r(P))

How to compute page rank? n For a given network of web pages, n

Example: iteration K=1 IR(P)=1/3 for all nodes, d=0. 9 A C B node IP

Example: k=2 A l is the in-degree of P C B node IP A

Example: k=2 (normalize) A C B node IP A 0. 38 B 0. 095

Crawler Control n All crawlers maintain several queues of URL’s to pursue next n

Crawler Control n Thus, it is important to visit important pages first Let G

Test Result: 179, 000 pages Percentage of Stanford Web crawled vs. PST – the

Google Algorithm n First, compute the page rank of each page on WWW n

How powerful is Google? n n A Page. Rank for 26 million web pages

Hubs and Authorities 1998 n Kleinburg, Cornell University n http: //www. cs. cornell. edu/home/kleinber/

Hubs and Authorities Others - An authority is a page pointed to by many

H&A Search Engine Algorithm n n First submit query Q to a text search

Link Analysis: weights n Let Bij=1 if i links to j, 0 otherwise n

Link Analysis: update a-weight h 1 a h 2 (1) 60

Link Analysis: update h-weight a 1 h a 2 (2) 61

H&A: algorithm Set value for K, the number of iterations Initialize all a and

DOES it converge? n n n Yes, the Kleinberg paper includes a proof Needs

Example: K=1 h=1 and a=1 for all nodes A C B node a h

Example: k=1 (update a) A C B node a h A 1 1 B

Example: k=1 (update h) A C B node a h A 1 2 B

Example: k=1 (normalize) Use Equation (3’) A C B node a h A 1/3

Example: k=2 (update a, h, normalize) Use Equation (1) A C B node a

Search Engine Using H&A n For each query q, n n n Enter q

Conclusions n n Link based analysis is very powerful in find out the important

Slides: 70

Download presentation

Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST 1

Keyword Extraction n Goal: n n n given N documents, each consisting of words, extract the most significant subset of words keywords Example n n [All the students are taking exams] -- >[student, take, exam] Keyword Extraction Process n n n remove stop words stem remaining terms collapse terms using thesaurus build inverted index extract key words - build key word index extract key phrases - build key phrase index 2

Stop Words and Stemming n From a given Stop Word List n n n [a, about, again, are, the, to, of, …] Remove them from the documents Or, determine stop words n Given a large enough corpus of common English n Sort the list of words in decreasing order of their occurrence frequency in the corpus n Zipf’s law: Frequency * rank constant n n most frequent words tend to be short most frequent 20% of words account for 60% of usage 3

Zipf’s Law -- An illustration 4

Resolving Power of Word Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words Words in decreasing frequency order 5

Stemming n The next task is stemming: transforming words to root form n n Suffix based methods n n n Computing, Computer, Computation comput Remove “ability” from “computability” “…”+ness, “…”+ive, remove Suffix list + context rules 6

Thesaurus Rules n A thesaurus aims at n classification of words in a language n for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e. g. , composed of) Static Thesaurus Tables n n [anneal, strain], [antenna, receiver], … Roget’s thesaurus Word. Net at Preinceton 7

Thesaurus Rules can also be Learned n From a search engine query log n n After typing queries, browse… If query 1 and query 2 leads to the same document n n If query 1 leads to Document with title keyword K, n n n Then, Similar(query 1, query 2) Then, Similar(query 1, K) Then, transitivity… Microsoft Research China’s work in WWW 10 (Wen, et al. ) on Encarta online 8

The Vector-Space Model n n T distinct terms are available; call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document n <T 1, T 2, T 3, T 4, T 5> or <W(T 1), W(T 2), W(T 3), W(T 4), W(T 5)> T 1=architecture T 2=bus T 3=computer T 4=database T 5=xml computer science collection index terms or vocabulary of the colelction 9

The Vector-Space Model n Assumptions: words are uncorrelated Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight D 1 D 2 : : Dn T 1 d 11 d 21 : : dn 1 T 2 d 12 d 22 : : dn 2 …. … … … Tt d 1 t d 2 t : : dnt 4. We will deal with how to compute the weights later 10

Graphic Representation Example: D 1 = 2 T 1 + 3 T 2 + 5 T 3 D 2 = 3 T 1 + D 1 = 2 T 1+ 3 T 2 + 5 T 3 7 T 2 + T 3 Q = 0 T 1 + 0 T 2 + 2 T 3 5 Q = 0 T 1 + 0 T 2 + 2 T 3 2 3 T 1 D 2 = 3 T 1 + 7 T 2 + T 3 T 2 7 • Is D 1 or D 2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection? 11

Similarity Measure - Inner Product n n n Similarity between documents Di and query Q can be computed as the inner vector product: sim ( Di , Q ) = (Di Q) Binary: weight = 1 if word present, 0 o/w Non-binary: weight represents degree of similary n Example: TF/IDF we explain later 12

Inner Product -- Examples Binary: nt ion n re r e u m at al ase tect ute e v ie tab chi p xt nag form r t m re da ar co te ma in n D = 1, 1, 0, 1, 0 n Q = 1, 0 , 1, 0, 1, 1 Size of vector = size of vocabulary = 7 sim(D, Q) = 3 Weighted D 1 = 2 T 1 + 3 T 2 + 5 T 3 Q = 0 T 1 + 0 T 2 + 2 T 3 sim(D 1 , Q) = 2*0 + 3*0 + 5*2 = 10 13

Properties of Inner Product n n The inner product similarity is unbounded Favors long documents n n long document a large number of unique terms, each of which may occur many times measures how many terms matched but not how many terms not matched 14

Cosine Similarity Measures t 3 n n Cosine similarity measures the 1 cosine of the angle between two D 1 Q vectors 2 Inner product normalized by the vector lengths t 2 t 1 D 2 Cos. Sim(Di, Q) = 15

Cosine Similarity: an Example D 1 = 2 T 1 + 3 T 2 + 5 T 3 Cos. Sim(D 1 , Q) = 5 / 38 = 0. 81 D 2 = 3 T 1 + 7 T 2 + T 3 Cos. Sim(D 2 , Q) = 1 / 59 = 0. 13 Q = 0 T 1 + 0 T 2 + 2 T 3 D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product 16

Document and Term Weights n Document term weights are calculated using frequencies in documents (tf) and in collection (idf) tfij = frequency of term j in document i df j = document frequency of term j = number of documents containing term j idfj = inverse document frequency of term j = log 2 (N/ df j) (N: number of documents in collection) n Inverse document frequency -- an indication of term values as a document discriminator. 17

Term Weight Calculations n Weight of the jth term in ith document: dij = tfij idfj = tfij log 2 (N/ df j) n TF Term Frequency n n n A term occurs frequently in the document but rarely in the remaining of the collection has a high weight Let maxl{tflj} be the term frequency of the most frequent term in document j Normalization: term frequency = tfij /maxl{tflj} 18

An example of TF n n Document=(A Computer Science Student Uses Computers) Vector Model based on keywords (Computer, Engineering, Student) Tf(Computer) = 2 Tf(Engineering)=0 Tf(Student) = 1 Max(Tf)=2 TF weight for: Computer = 2/2 = 1 Engineering = 0/2 = 0 Student = ½ = 0. 5 19

Inverse Document Frequency n n Dfj gives a the number of times term j appeared among N documents IDF = 1/DF Typically use log 2 (N/ df j) for IDF Example: given 1000 documents, computer appeared in 200 of them, n IDF= log 2 (1000/ 200) =log 2(5) 20

TF IDF n n n dij = (tfij /maxl{tflj}) idfj = (tfij /maxl {tflj}) log 2 (N/ df j) Can use this to obtain non-binary weights Used in the SMART Information Retrieval System by the late Gerald Salton and MJ Mc. Gill, Cornell University to tremendous success, 1983 21

Implementation based on Inverted Files n n In practice, document vectors are not stored directly; an inverted organization provides much better access speed. The index file can be implemented as a hash file, a sorted list, or a B-tree. Dj, tfj Index terms df computer 3 D 7 , 4 database 2 D 1 , 3 4 D 2 , 4 1 D 5 , 2 science system 22

A Simple Search Engine Now we have got enough tools to build a simple Search engine (documents == web pages) n 1. 2. 3. 4. Starting from well known web sites, crawl to obtain N web pages (for very large N) Apply stop-word-removal, stemming and thesaurus to select K keywords Build an inverted index for the K keywords For any incoming user query Q, 1. For each document D 1. 2. 3. 4. Compute the Cosine similarity score between Q and document D Select all documents whose score is over a certain threshold T Let this result set of documents be M Return M to the user 23

Remaining Questions n n How to crawl? How to evaluate the result n Given 3 search engines, which one is better? n Is there a quantitative measure? 24

Measurement n n Let M documents be returned out of a total of N documents; N=N 1+N 2 n n n M=M 1+M 2 n n N 1 total documents are relevant to query N 2 are not M 1 found documents are relevant to query M 2 are not Precision = M 1/M Recall = M 1/N 1 25

Entire document collection Relevant documents Retrieved documents relevant irrelevant Retrieval Effectiveness - Precision and Recall retrieved & irrelevant Not retrieved & irrelevant retrieved & relevant not retrieved but relevant retrieved not retrieved 26

Precision and Recall n Precision n Recall n n evaluates the correlation of the query to the database an indirect measure of the completeness of indexing algorithm the ability of the search to find all of the relevant items in the database Among three numbers, n only two are always available n n n total number of items retrieved number of relevant items retrieved total number of relevant items is usually not available 27

Relationship between Recall and Precision Return relevant documents but miss many useful ones too The ideal precision 1 0 1 recall Return mostly relevant documents but include many junks too 28

Fallout Rate n Problems with precision and recall: n n n A query on “Hong Kong” will return most relevant documents but it doesn't tell you how good or how bad the system is ! number of irrelevant documents in the collection is not taken into account recall is undefined when there is no relevant document in the collection precision is undefined when no document is retrieved Fallout can be viewed as the inverse of recall. A good system should have high recall and low fallout 29

Total Number of Relevant Items n n In an uncontrolled environment (e. g. , the web), it is unknown. Two possible approaches to get estimates n n Sampling across the database and performing relevance judgment on the returned items Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total relevant algorithm 30

Computation of Recall and Precision Suppose: total no. of relevant docs = 5 R=1/5=0. 2; p=1/1=1 R=2/5=0. 4; p=2/2=1 R=2/5=0. 4; p=2/3=0. 67 R=5/5=1; p=5/13=0. 38 31

precision Computation of Recall and Precision 1 1. 0 2 4 0. 8 3 0. 6 5 6 7 13 0. 4 12 0. 2 200 0. 2 0. 4 0. 6 0. 8 1. 0 recall 32

Compare Two or More Systems n n n Computing recall and precision values for two or more systems Superimposing the results in the same graph The curve closest to the upper right-hand corner of the graph indicates the best performance 33

The TREC Benchmark n n n TREC: Text Retrieval Conference Originated from the TIPSTER program sponsored by Defense Advanced Research Projects Agency (DARPA) Became an annual conference in 1992, co-sponsored by the National Institute of Standards and Technology (NIST) and DARPA Participants are given parts of a standard set of documents and queries in different stages for testing and training Participants submit the P/R values on the final document and query set and present their results in the conference http: //trec. nist. gov/ 34

Interactive Search Engines n Aims to improve their search results incrementally, often applies to query “Find all sites with certain property” n Content based Multimedia search: given a photo, find all other photos similar to it n Large vector space n Question: which feature (keyword) is important? Procedure: n User submits query n Engine returns result n n n User marks some returned result as relevant or irrelevant, and continues search Engine returns new results Iterates until user satisfied 35

Query Reformulation n Based on user’s feedback on returned results n n n Documents that are relevant DR Documents that are irrelevant DN Build a new query vector Q’ from Q n n n <w 1, w 2, … wt> <w 1’, w 2’, … wt’> Best known algorithm: Rocchio’s algorithm Also extensively used in multimedia search 36

Query Modification n n Using the previously identified relevant and nonrelevant document set DR and DN to repeatedly modify the query to reach optimality Starting with an initial query in the form of where Q is the original query, and , , and are suitable constants 37

An Example T 1 T 2 T 3 T 4 T 5 Q = ( 5, 0, 3, 0, 1) D 1 = ( 2, 1, 2, 0, 0) D 2 = ( 1, 0, 0, 0, 2) Q: original query D 1: relevant doc. D 2: non-relevant doc. = 1, = 1/2, = 1/4 Assume: dot-product similarity measure Sim(Q, D 1) = (5 2)+(0 1)+(3 2)+(0 0)+(1 0) = 16 Sim(Q, D 2) = (5 1)+(0 0)+(3 0)+(0 0)+(1 2) = 7 38

Example (Cont. ) New Similarity Scores: Sim(Q’D 1)=(5. 75 2)+(0. 5 1)+(4 2)+(0 0)+(0. 5 0)=20 Sim(Q’D 2)=(5. 75 1)+(0. 5 0)+(4 0)+(0. 5 2)=6. 75 39

Link Based Search Engines Qiang Yang HKUST 40

Search Engine Topics n Text-based Search Engines n Document based n n Ranking: TF-IDF, Vector Space Model No relationship between pages modeled Cannot tell which page is important without query Link-based search engines: Google, Hubs and Authorities Techniques n Can pick out important pages 41

The Page. Rank Algorithm n Fundamental question to ask n n Information Retrieval n n What is the importance level of a page P, Cosine + TF IDF does not give related hyperlinks Link based n n Important pages (nodes) have many other links point to it Important pages also point to other important pages 42

The Google Crawler Algorithm n “Efficient Crawling Through URL Ordering”, n Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Stanford n n n http: //www. www 8. org http: //www-db. stanford. edu/~cho/crawler-paper/ “Modern Information Retrieval”, BY-RN Pages 380— 382 Lawrence Page, Sergey Brin. The Anatomy of a Search Engine. The Seventh International WWW Conference (WWW 98). Brisbane, Australia, April 14 -18, 1998. n http: //www. www 7. org n n 43

Back Link Metric IB(P)=3 Web Page P n n IB(P) = total number of backlinks of P IB(P) impossible to know, thus, use IB’(P) which is the number of back links crawler has seen so far 44

Page Rank Metric C=2 T 1 Let 1 -d be probability that user randomly jump T 2 to page P; “d” is the damping factor TN Web Page P d=0. 9 Let Ci be the number of out links from each Ti 45

Matrix Formulation n Consider a random walk on the web (denote IR(P) by r(P)) n Let Bij = probability of going directly from i to j n Let ri be the limiting probability (page rank) of being at page i Thus, the final page rank r is a principle eigenvector of BT 46

How to compute page rank? n For a given network of web pages, n n n Initialize page rank for all pages (to one) Set parameter (d=0. 90) Iterate through the network, L times 47

Example: iteration K=1 IR(P)=1/3 for all nodes, d=0. 9 A C B node IP A 1/3 B 1/3 C 1/3 48

Example: k=2 A l is the in-degree of P C B node IP A 0. 4 B 0. 1 C 0. 55 Note: A, B, C’s IP values are Updated in order of A, then B, then C Use the new value of A when calculating B, etc. 49

Example: k=2 (normalize) A C B node IP A 0. 38 B 0. 095 C 0. 52 50

Crawler Control n All crawlers maintain several queues of URL’s to pursue next n n n Google initially maintains 500 queues Each queue corresponds to a web site pursuing Important considerations: n n Limited buffer space Limited time Avoid overloading target sites Avoid overloading network traffic 51

Crawler Control n Thus, it is important to visit important pages first Let G be a lower bound threshold on I(P) n Crawl and Stop n n n Select only pages with IP>G to crawl, Stop after crawled K pages 52

Test Result: 179, 000 pages Percentage of Stanford Web crawled vs. PST – the percentage of hot pages visited so far 53

Google Algorithm n First, compute the page rank of each page on WWW n n n (very simplified) Query independent Then, in response to a query q, return pages that contain q and have highest page ranks A problem/feature of Google: favors big commercial sites 54

How powerful is Google? n n A Page. Rank for 26 million web pages can be computed in a few hours on a medium size workstation Currently has indexed a total of 1. 3 Billion pages 55

Hubs and Authorities 1998 n Kleinburg, Cornell University n http: //www. cs. cornell. edu/home/kleinber/ n Main Idea: type “java” in a text-based search engine n n Get 200 or so pages Which one’s are authoritive? n n http: //java. sun. com What about others? n www. yahoo. com/Computer/Program. Languages 56

Hubs and Authorities Others - An authority is a page pointed to by many strong hubs; - A hub is a page that points to many strong authorities Authorities Hubs 57

H&A Search Engine Algorithm n n First submit query Q to a text search engine Second, among the results returned n n select ~200, find their neighbors, compute Hubs and Authorities Third, return Authorities found as final result Important Issue: how to find Hubs and Authorities? 58

Link Analysis: weights n Let Bij=1 if i links to j, 0 otherwise n n n hi=hub weight of page i ai = authority weight of page I Weight normalization But, for simplicity, we will use (3) (3’) 59

Link Analysis: update a-weight h 1 a h 2 (1) 60

Link Analysis: update h-weight a 1 h a 2 (2) 61

H&A: algorithm Set value for K, the number of iterations Initialize all a and h weights to 1 For l=1 to K, do 1. 2. 3. a. b. c. Apply equation (1) to obtain new ai weights Apply equation (2) to obtain all new hi weights, using the new ai weights obtained in the last step Normalize ai and hi weights using equation (3) 62

DOES it converge? n n n Yes, the Kleinberg paper includes a proof Needs to know Linear algebra and eigenvector analysis We will skip the proof but only using the results: n The a and h weight values will converge after sufficiently large number of iterations, K. 63

Example: K=1 h=1 and a=1 for all nodes A C B node a h A 1 1 B 1 1 C 1 1 64

Example: k=1 (update a) A C B node a h A 1 1 B 0 1 C 2 1 65

Example: k=1 (update h) A C B node a h A 1 2 B 0 2 C 2 1 66

Example: k=1 (normalize) Use Equation (3’) A C B node a h A 1/3 2/5 B 0 2/5 C 2/3 1/5 67

Example: k=2 (update a, h, normalize) Use Equation (1) A C B node a h A 1/5 4/9 B 0 4/9 C 4/5 1/9 If we choose a threshold of ½, then C is an Authority, and there are no hubs. 68

Search Engine Using H&A n For each query q, n n n Enter q into a text-based search engine Find the top 200 pages Find the neighbors of the 200 pages by one link, let the set be S Find hubs and authorities in S Return authorities as final result 69

Conclusions n n Link based analysis is very powerful in find out the important pages Models the web as a graph, and based on in-degree and out-degree Google: crawl only important pages H&A: post analysis of search result 70