2 Fig 1 shows the Page Rank Algorithm

  • Slides: 7
Download presentation
2. Fig. 1 shows the Page. Rank Algorithm with random teleports and the web

2. Fig. 1 shows the Page. Rank Algorithm with random teleports and the web link structure: a. Construct the column stochastic matrix M and A. b. Calculate the Page. Rank with random transports ( = 0. 8) for three iterations. c. Which, among the three nodes {Yahoo, Amazon, M’soft}, is the most important node? (a) (b) Fig. 1 (a) The Page. Rank algorithm (b) The web link structure

Ans: 1/2 0 M = 1/2 0 0 0 1/2 1/2 0 1/3 1/3

Ans: 1/2 0 M = 1/2 0 0 0 1/2 1/2 0 1/3 1/3 A = 0. 8 1/2 0 0 + 0. 2 1/3 1/3 0 1/2 1 1/3 1/3 7/15 1/15 = 7/15 1/15 7/15 13/15 rk+1 = Ark y a = m r 0 r 1 r 2 r 3 1 1. 00 0. 60 1. 40 0. 84 0. 60 1. 56 0. 776 0. 536 1. 688 The most important node

3. As shown in Fig. 2, document frequency thresholding is an important step toward

3. As shown in Fig. 2, document frequency thresholding is an important step toward feature extraction in text mining. a) Given the content for document D 1 and D 2 shown in Fig. 3, fill the documentfeature matrix. b) Given N=10 and =1. 5, what are the feature terms extracted from D 1 by using inverse document frequency weighting? c) Given N=10 and =1. 0, what are the feature terms extracted from D 2 by using entropy weighting? ps. c)小題之計算複雜度高,介於考試時間有限,故本題型 在期末考出現的機率很低。 (c) Fig. 2 Flowchart for feature extraction in text mining. Fig. 3

Feature Extraction: Weighting Model(3) • Entropy weighting where average entropy of j-th term gfj

Feature Extraction: Weighting Model(3) • Entropy weighting where average entropy of j-th term gfj : : = number of times j-th term occurs in the whole training document collection -1: if word occurs once time in every document 0: if word occurs in only one document

Ans: a) A B K O Q R S T W X D 1

Ans: a) A B K O Q R S T W X D 1 4 1 0 1 1 1 1 D 2 4 2 1 0 1 b) wij = Freqij * log(N/ Doc. Freqj) A B K O Q R Doc. Freqj 10 5 3 3 5 N/Doc. Freqj 1. 00 2. 00 3. 33 log 2(N/Doc. Freqj) 0. 00 1. 74 4 1 0. 00 1. 00 Freqij for D 1 tf×idf S 2 T W X 1 5 3 5 2. 00 5. 00 10. 00 2. 00 3. 33 2. 00 1. 74 1. 00 2. 32 3. 32 1. 00 1. 74 1. 00 0 1 1 1 1 0. 00 1. 74 1. 00 2. 32 3. 32 1. 00 1. 74 1. 00 Q R S T W X => feature terms: O, R, S, W c) wij = log 2(Freqij +1)* (1 -entropy(wi)) A B K O Freqij for D 2 4 2 1 0 1 Entropy(wj) 0. 4 0. 1 0. 3 0. 4 0. 3 0. 1 log 2(N/Doc. Freqj) 1. 39 1. 43 0. 90 0. 00 0. 70 0. 60 0. 00 0. 90 => feature terms: A, B

4. Data Preprocessing is essential for web. Fig. usage mining. 4 a) Explain the

4. Data Preprocessing is essential for web. Fig. usage mining. 4 a) Explain the four steps data preprocessing b) Given the web page linkage shown in Fig 4. (c), refine the user sessions shown in Fig. 4 (a). c) Given the web page linkage shown in Fig 4. (c), complete the paths in Fig. 4. (b). Fig. 4

Ans: a) b) Three Sessions: -A-B-F-O-G-A-D -L-R -A-B-C-J c) Four Sessions: -A-B-F-O-F-B-G -A-D -L-R

Ans: a) b) Three Sessions: -A-B-F-O-G-A-D -L-R -A-B-C-J c) Four Sessions: -A-B-F-O-F-B-G -A-D -L-R -A-B-A-C-J or Four Sessions: -A-B-F-O-G -A-D -L-R -A-B-C-J