Clustering Supervised segmentation Unsupervised segmentation The idea of

  • Slides: 62
Download presentation
Clustering (分群) • Supervised segmentation (監督式區隔) • Unsupervised segmentation (非監督式區隔) • The idea of

Clustering (分群) • Supervised segmentation (監督式區隔) • Unsupervised segmentation (非監督式區隔) • The idea of finding natural groupings in the data may be called unsupervised segmentation, or more simply clustering. • 在數據中找到自然分組的想法可以被稱為非監督式區隔,或者更簡單地稱為分群。 1

Clustering (分群) • Supervised segmentation (監督式區隔) • Unsupervised segmentation (非監督式區隔) • Clustering is another

Clustering (分群) • Supervised segmentation (監督式區隔) • Unsupervised segmentation (非監督式區隔) • Clustering is another application of our fundamental notion of similarity. The basic idea is that we want to find groups of objects (consumers, businesses, whiskeys, etc. ), where the objects within groups are similar, but the objects in different groups are not so similar. • 分群為相似性(similarity)的另一種應用。意思是要找到組內中的對象(如消費者, 企業,威士忌等)相似的群集,但是這些對象在其他群組裡面並不相似。 2

Example: Whiskey Analytics Revisited • We want to take our example a step further

Example: Whiskey Analytics Revisited • We want to take our example a step further and find clusters of similar whiskeys. • 我們想將此例更進一步探究,並找到相似威士忌的集群。 • One reason we might want to find clusters of whiskeys is simply to understand the problem better. • 找到不同種類的威士忌集群,將能夠讓我們更能理解這個問題。 3

Example: Whiskey Analytics Revisited • This is an example of exploratory data analysis, to

Example: Whiskey Analytics Revisited • This is an example of exploratory data analysis, to which data -rich businesses should continually devote some energy and resources, as such exploration can lead to useful and profitable discoveries. • 這是探索性數據分析的一個例子,具備豐富數據量的企業應該不斷投入資源探索(其數 據),這樣的探索將能帶來有用和有利可圖的發現。 4

Example: Whiskey Analytics Revisited • In our example, if we are interested in Scotch

Example: Whiskey Analytics Revisited • In our example, if we are interested in Scotch whiskeys, we may simply want to understand the natural groupings by taste— because we want to understand our “business, ” which might lead to a better product or service. See page 164 for explanation. • 在我們的例子中,若我們對蘇格蘭威士忌感興趣,我們想透過口味來理解自然分組 - 因 為我們想要了解我們的“業務”,以便提供更好的產品或服務。 5

Hierarchical Clustering(階層式分群法) Figure 6 -6. Six points and their possible clusterings. At left are

Hierarchical Clustering(階層式分群法) Figure 6 -6. Six points and their possible clusterings. At left are shown six points, A-F, with circles 1 -5 showing different distance-based groupings that could be imposed. These groups form an implicit hierarchy. At the right is a dendrogram corresponding to the groupings, which makes the hierarchy explicit. 圖 6 -6。六個點及其可能的集群。圖左的六個點A-F,這六個點 帶有標示一號到五號的圓圈,顯示根據距離的不同的分組。這 些組別形成一個隱含的階層。下邊是對應於這些組別樹形圖, 這樹形圖使得階層結構明顯。 6

Hierarchical Clustering • An advantage of hierarchical clustering is that it allows the data

Hierarchical Clustering • An advantage of hierarchical clustering is that it allows the data analyst to see the groupings—the“landscape” of data similarity—before deciding on the number of clusters to extract. • 階層式分群法的一個優點是它能讓數據分析師,在決定提取分群數量之前查看各組別 中,其數據相似性的「樣態」。 • Note also that once two clusters are joined at one level, they remain joined in all higher levels of the hierarchy. • 另請注意,一旦兩個集群在其中一個級別連接,它們之後將保持在較高的階層結構中 依然是連接的。 • The clusters are merged based on the similarity or distance function that is chosen. • 集群之間之整合是根據所選擇的相似性或距離函數。 7

Tree of Life • One of the best known uses of hierarchical clustering is

Tree of Life • One of the best known uses of hierarchical clustering is in the “Tree of Life” (Sugden et al. , 2003; Pennisi, 2003) • 階層式分群法最著名用途之一是「生命樹」 • This chart is based on a hierarchical clustering of RNA sequences. • 該圖表基於RNA序列的層次分群。 • The center is the“last universal ancestor”of all life on earth, from which branch the three domains of life (eukaryota, bacteria, and archaea). • 該中心是地球上所有生命的「最終祖先」,從原初三個生命的領域(真核生物,細菌和 古菌)分支出來。 8

9

9

A magnified portion of this tree containing the particular bacterium Helicobacter pylori, which causes

A magnified portion of this tree containing the particular bacterium Helicobacter pylori, which causes ulcers 將此樹放大一部分我們看到會導致潰瘍 的幽門螺桿菌 10

Whiskey Analytics • Since Foster loves the Bunnahabhain recommended to him by his friend

Whiskey Analytics • Since Foster loves the Bunnahabhain recommended to him by his friend at the restaurant the other night, the clustering suggests a set of other “most similar” whiskeys (Bruichladdich, Tullibardine, etc. ) • The most unusual tasting single malt in the data appears to be Aultmore, at the very top, which is the last whiskey to join any others. 11

Whiskey Analytics • This excerpt shows that most of its nearest neighbors (Tullibardine, Glenglassaugh,

Whiskey Analytics • This excerpt shows that most of its nearest neighbors (Tullibardine, Glenglassaugh, etc. ) do indeed cluster near it in the hierarchy. • You may wonder why the clusters do not correspond exactly to the similarity ranking. 12

Example : Whiskey Analytics • Foster, one of the authors, likes Bunnahabhain. He wants

Example : Whiskey Analytics • Foster, one of the authors, likes Bunnahabhain. He wants to find similar ones. The following five single-malt Scotches are most similar to Bunnahabhain, ordered by increasing distance: • 本書其中一位作者福斯特喜歡Bunnahabhain。 他想找到類似的。 以下五種單一麥芽 蘇格蘭威士忌與Bunnahabhain最相似,按距離增加排序: 13

Whiskey Analytics • The reason is that, while the five whiskeys we found are

Whiskey Analytics • The reason is that, while the five whiskeys we found are the most similar to Bunnahabhain, some of these five are more similar to other whiskeys in the dataset, so they are clustered with these closer neighbors before joining Bunnahabhain. 14

Whiskey Analytics • So instead of simply stocking the most recognizable Scotches, or a

Whiskey Analytics • So instead of simply stocking the most recognizable Scotches, or a few Highland, Lowland, and Islay brands, our specialty shop owner could choose to stock single malts from the different clusters. • Alternatively, one could create a guide to Scotch whiskeys that might help single malt lovers to choose whiskeys. 15

Nearest Neighbors Revisited: Clustering Around Centroids 重訪最近鄰:圍繞質心分群 • The most common method for focusing

Nearest Neighbors Revisited: Clustering Around Centroids 重訪最近鄰:圍繞質心分群 • The most common method for focusing on the clusters themselves is to represent each cluster by its“cluster center, ” or centroid. • 分群最常用方法是以“分群的中心”或質心來表示每個分群。 • Figure 6 -10 illustrates the idea in two dimensions: here we have three clusters, whose instances are represented by the circles. • 圖 6 -10 以兩個面向說明:這裡有三個分群,其實例由圓圈表示。 16

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • Each cluster has a centroid, represented by

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • Each cluster has a centroid, represented by the solid-lined star. • 每個集群都有一個質心,由實線星號代表。 • The star is not necessarily one of the instances; it is the geometric center of a group of instances. • 星不一定是其中一個例子; 它是一組實例的幾何中心。 17

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 圖 6 -10。 k-means算法的第一步:找到最接近 所選中心的點(可能是隨機選擇的)。 這產生 第一組群集。 圖

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 圖 6 -10。 k-means算法的第一步:找到最接近 所選中心的點(可能是隨機選擇的)。 這產生 第一組群集。 圖 6 -11。 k-means算法的第二步:找到第一步 中找到的群集的實際中心。 Figure 6 -10. The first step of the k-means algorithm: find the points closest to the chosen centers (possibly chosen randomly). This results in the first set of clusters. Figure 6 -11. The second step of the k-means algorithm: find the actual center of the clusters found in the first step. 18

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • The most popular centroid-based clustering algorithm is

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • The most popular centroid-based clustering algorithm is called k-means clustering. (Mac. Queen, 1967; Lloyd, 1982; Mac. Kay, 2003) • 最常見的基於質心的集群算法稱為k-means分群。 • In k-means the “means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster. • 在k-means中,means是質心,由集群中實例的每個維度值的算術平均值(平均數) 表示。 19

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • So in Figure 6 -10, to compute

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • So in Figure 6 -10, to compute the centroid for each cluster, we would average all the x values of the points in the cluster to form the x coordinate of the centroid, and average all the y values to form the centroid’s y coordinate. • 因此,在圖 6 -10中,為了計算每個集群的質心,我們將集群中,所有點的x值取平均值, 以形成質心的x坐標,然後將所有y值取平均,以形成質心的y坐標。 20

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • Generally, the centroid is the average of

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • Generally, the centroid is the average of the values for each feature of each example in the cluster. The result is shown in Figure 6 -11. • 通常,質心是分群示例中每個特徵的平均值。 結果如圖 6 -11所示。 • The k in k-means is simply the number of clusters that one would like to find in the data. • k-means中的k是我們希望從數據中找到的分群數目。 • Unlike hierarchical clustering, k-means starts with a desired number of clusters k. • 與階層分群不同,k-means以期望數量的分群k開始。 21

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • So, in Figure 6 -10, the analyst

Nearest Neighbors Revisited: Clustering Around Centroids最近鄰重訪:圍繞質心分群 • So, in Figure 6 -10, the analyst would have specified k=3, and the k-means clustering method would return (i) the three cluster centroids when cluster method terminates (the three solid-lined stars in Figure 6 -11), plus (ii) information on which of the data points belongs to each cluster. • 因此,在圖 6 -10中,分析師將指定k = 3,並且k-means分群方法將傳回 (i)分群方法終止時的三個分群質心(圖 6 -11中的三個實心星號) (ii)每個數據點屬於哪個分群 22

Nearest Neighbors Clustering最近鄰分群 • This is sometimes referred to as nearest-neighbor clustering because the

Nearest Neighbors Clustering最近鄰分群 • This is sometimes referred to as nearest-neighbor clustering because the answer to (ii) is simply that each cluster contains those points that are nearest to its centroid (rather than to one of the other centroids). • 這有時被稱為最近鄰分群,因為(ii)每個分群包含最靠近其質心的那些點。 23

k-means algorithm演算法 • The k-means algorithm for finding the clusters is simple and elegant.

k-means algorithm演算法 • The k-means algorithm for finding the clusters is simple and elegant. • 用於查找分群的k-means法簡單而優雅。 • The algorithm starts by creating k initial cluster centers, usually randomly, but sometimes by choosing k of the actual data points, or by being given specific initial starting points by the user, or via a pre-processing of the data to determine a good set of starting centers. • 該算法首先,通常隨機地創建k個初始分群中心,但有時通過選擇k個實際數據點,或 者通過用戶給出特定的初始起點,或者通過預處理數據來確定一個好的起始中心。 24

k-means algorithm演算法 • Think of the stars in Figure 6 -10 as being these

k-means algorithm演算法 • Think of the stars in Figure 6 -10 as being these initial (k=3) cluster centers. • 將圖 6 -10中的星號視為這些初始(k = 3)分群中心。 • Then the algorithm proceeds as follows. As shown in Figure 6 -10, the clusters corresponding to these cluster centers are formed, by determining which is the closest center to each point. • 然後算法如下進行。 如圖 6 -10所示,通過確定哪個是最接近每個點的中心,我們便能 夠形成與分群中心對應的分群/集群。 25

k-means algorithm演算法 • Next, for each of these clusters, its center is recalculated by

k-means algorithm演算法 • Next, for each of these clusters, its center is recalculated by finding the actual centroid of the points in the cluster. • 接下來,從分群之中找到點的實際質心位置,這每個分群的中心將被重新計算。 • As shown in Figure 6 -11, the cluster centers typically shift; in the figure, we see that the new solid-lined stars are indeed closer to what intuitively seems to be the center of each cluster. And that’s pretty much it. • 如圖 6 -11所示,分群中心通常會移位; 在圖中,我們看到新的實心星號確實更接近於每 個集群的中心。 26

k-means algorithm演算法 • The process simply iterates: since the cluster centers have shifted, we

k-means algorithm演算法 • The process simply iterates: since the cluster centers have shifted, we need to recalculate which points belong to each cluster (as in Figure 6 -10). • 該過程將不斷循環/迭代:由於分群中心已經移位,我們需要重新計算哪些點屬於每個 分群(如圖 6 -10所示)。 • Once these are reassigned, we might have to shift the cluster centers again. • 重新分配後,我們或許得再次移動分群中心。 27

k-means algorithm演算法 • The k-means procedure keeps iterating until there is no change in

k-means algorithm演算法 • The k-means procedure keeps iterating until there is no change in the clusters (or possibly until some other stopping criterion is met). • K-means過程將不斷循環/迭代,直到集群中沒有變化(或者可能直到滿足一些其他停 止標準)。 28

k-means algorithm演算法 • Figure 6 -12 and Figure 6 -13 show an example run

k-means algorithm演算法 • Figure 6 -12 and Figure 6 -13 show an example run of kmeans on 90 data points with k=3. • 圖 6 -12和圖 6 -13顯示了在k = 3的90個數據點上運行k-means的示例。 • This dataset is a little more realistic in that it does not have such well-defined clusters as in the previous example. • 這個數據集更加真實,因為它沒有像前面例子那樣定義明確的分群。 • Figure 6 -12 shows the initial data points before clustering. • 圖 6 -12顯示了分群之前的初始數據點。 29

Figure 6 -12. A k-means clustering example using 90 points on a plane and

Figure 6 -12. A k-means clustering example using 90 points on a plane and k=3 centroids. This figure shows the initial set of points. 圖 6 -12。 k-means分群示例,在平面 上使用 90個點,k = 3個質心。 該圖 顯示了初始點集。 30

k-means algorithm演算法 • Figure 6 -13 shows the final results of clustering after 16

k-means algorithm演算法 • Figure 6 -13 shows the final results of clustering after 16 iterations. • 圖 6 -13顯示了16次迭代後的分群的最終結果。 • The three (erratic) lines show the path from each centroid’s initial (random) location to its final location. • 三條(不規則的)線顯示從每個質心的初始(隨機)位置到其最終位置的路徑。 • The points in the three clusters are denoted by different symbols (circles, x’s, and triangles). • 三個分群中的點由不同的符號(圓圈,x和三角形)表示。 31

Figure 6 -13. A k-means clustering example using 90 points on a plane and

Figure 6 -13. A k-means clustering example using 90 points on a plane and k=3 centroids. This figure shows the movement paths of centroids (each of three lines) through 16 iterations of the clustering algorithm. The marker shape of each point represents the cluster identity to which it is finally assigned. 圖 6 -13。 一個k-means分群示例,在平 面上使用 90個點,k = 3個質心。 該圖顯 示了通過分群算法的16次迭代的質心(三 條線中的每條線)的移動路徑。 每個點的 標記形狀表示最終分配的分群標識。 32

Centroid algorithms質心算法 • A common concern with centroid algorithms such as kmeans is how

Centroid algorithms質心算法 • A common concern with centroid algorithms such as kmeans is how to determine a good value for k. • 諸如k-means之類的質心算法關注如何確定k的值。 • One answer is simply to experiment with different k values and see which ones generate good results. • 一個答案就是試驗不同的k值,看看哪些產生了良好的結果。 • Since k-means is often used for exploratory data mining, the analyst must examine the clustering results anyway to determine whether the clusters make sense. 由於k-means • 通常用於探索性數據挖掘,因此分析人員必須檢查分群結果,以確定分群是否有意義。 33

Centroid algorithms質心算法 • Usually this can reveal whether the number of clusters is appropriate.

Centroid algorithms質心算法 • Usually this can reveal whether the number of clusters is appropriate. • 通常,這可以揭示分群的數量是否合適。 • The value for k can be decreased if some clusters are too small and overly specific, and increased if some clusters are too broad and diffuse. • 如果某些分群太小且過於特異,則可以減小k的值,如果某些分群太寬泛和擴散,則可 以增加k的值。 34

Clustering Business News Stories 將商 業新聞分群 • The objective of this example is to

Clustering Business News Stories 將商 業新聞分群 • The objective of this example is to identify, informally, different groupings of news stories released about a particular company. • 這個例子的目的是識別對一特定公司不同的新聞報導的群集。 35

Clustering Business News Stories 將商 業新聞分群 • This may be useful for a specific

Clustering Business News Stories 將商 業新聞分群 • This may be useful for a specific application, for example: to get a quick understanding of the news about a company without having to read every news story; to understand the data before undertaking a more focused data mining project, such as relating business news stories to stock performance. • 這可能對特定應用有用,例如:快速了解有關公司的新聞,而無需閱讀每個新聞報導; 在進行更專注的數據探勘項目之前了解數據,例如將商業新聞報導與股票業績聯繫起來。 36

Clustering Business News Stories將商業 新聞分群 • For this example we chose a large collection

Clustering Business News Stories將商業 新聞分群 • For this example we chose a large collection of (text) news stories: the Thomson Reuters Text Research Collection (TRC 2), a corpus (全集) of news stories created by the Reuters news agency, and made available to researchers. • 在這個例子中,我們選擇了大量的(文本)新聞報導:湯森路透文本研究集(TRC 2), 這是路透社新聞社創建的新聞集的全集,並提供給研究人員。 • The entire corpus comprises 1, 800, 370 news stories from January of 2008 through February of 2009 (14 months). • 從 2008年 1月到 2009年 2月(14個月),整個語料庫包含 1, 800, 370個新聞報導。 37

Clustering Business News Stories • To make the example tractable but still realistic, we’re

Clustering Business News Stories • To make the example tractable but still realistic, we’re going to extract only those stories that mention a particular company—in this case, Apple (whose stock symbol is AAPL). • 為了使示例易於處理且貼近現實情況,我們將僅提取那些提及特定公司的報導 - Apple (其股票代碼為AAPL)。 38

Data preparation • We extracted stories whose headlines specifically mentioned Apple—thus assuring that the

Data preparation • We extracted stories whose headlines specifically mentioned Apple—thus assuring that the story is very likely news about Apple itself. • 我們提取了標題專門提到Apple的報導,從而確保報導很可能是關於Apple本身的新聞。 • There were 312 such stories but they covered a wide variety of topics. • 有312個這樣的報導,但它們涵蓋了各種各樣的主題。 39

Data preparation • Then each document was represented by a numeric feature vector using

Data preparation • Then each document was represented by a numeric feature vector using “TFIDF scores” scoring for each vocabulary word in the document. • 然後,每個文檔由數字特徵向量表示,使用“TFIDF得分”對文檔中的每個詞彙單詞進 行評分。 • TFIDF (Term Frequency times Inverse Document Frequency) scores represent the frequency of the word in the document, penalized by the frequency of the word in the corpus. • TFIDF(術語頻率乘以逆文檔頻率)分數表示文檔中單詞的頻率,受到語料庫中單詞的 頻率的影響。 40

The news story clusters新聞報導之分群 • We chose to cluster the stories into nine groups

The news story clusters新聞報導之分群 • We chose to cluster the stories into nine groups (k=9 for kmeans). • 我們選擇將新聞報導分成九組(對於k-means,k = 9)。 • Here we present a description of the clusters, along with some headlines of the stories contained in that cluster. • 這裡,我們將介紹分群的描述,以及該分群中新聞報導的一些標題。 • It is important to remember that the entire news story was used in the clustering, not just the headline. • 重要的是要記住整個新聞報導用於分群,而不僅僅是標題。 41

The news story clusters cluster 1. these stories are analysts’announcements concerning ratings changes and

The news story clusters cluster 1. these stories are analysts’announcements concerning ratings changes and price target adjustments: 分群 1. 這些報導是分析師關於評等變化和價格目標調整的公告: • rbc raises apple <aapl. o> price target to $200 from $190; keeps out perform rating • Rbc提高Apple價格從 190美元到 200美元; 開展表現評級 • thinkpanmure assumes apple <aapl. o> with buy rating; $225 price target • 假設蘋果<aapl. o>與買入評等; 225美元的價格目標 • american technology raises apple <aapl. o> to buy from neutral • 美國技術公司提高apple <aapl. o>從中性到購買 42

The news story clusters cluster 1. these stories are analysts’announcements concerning ratings changes and

The news story clusters cluster 1. these stories are analysts’announcements concerning ratings changes and price target adjustments: 分群 1. 這些報導是分析師關於評等變化和價格目標調整的公告: • caris raises apple <aapl. o> price target to $200 from $170; rating above average • 價格目標從$ 170起$ 200; 平均評分超過平均水平 • caris cuts apple <aapl. o> price target to $155 from $165; keeps above average rating • 價格從 165美元到 155美元; keeps高於平均評級 43

Cluster 2. This cluster contains stories about Apple’s stock price movements, during and after

Cluster 2. This cluster contains stories about Apple’s stock price movements, during and after each day of trading: 分群 2. 此分群包含有關Apple每天交易期間和之後股價波動的報導: • Apple shares pare losses, still down 5 pct. • 股價下跌,仍然下跌5% • Apple rises 5 pct following strong results • 繼強勁業績後,蘋果股價上漲 5% • Apple shares rise on optimism over i. Phone demand • 股價因i. Phone需求樂觀而上漲 44

Cluster 2. This cluster contains stories about Apple’s stock price movements, during and after

Cluster 2. This cluster contains stories about Apple’s stock price movements, during and after each day of trading: 分群 2. 此分群包含有關Apple每天交易期間和之後股價波動的報導: • Apple shares decline ahead of Tuesday event • 股價在週二發布之前下跌 • Apple shares surge, investors like valuation • 股價飆升,投資者喜歡估值 45

Cluster 3. In 2008, there were many stories about Steve Jobs, Apple’s charismatic CEO,

Cluster 3. In 2008, there were many stories about Steve Jobs, Apple’s charismatic CEO, and his struggle with pancreatic cancer. Jobs’ declining health was a topic of frequent discussion, and many business stories speculated on how well Apple would continue without him. Such stories clustered here: 分群 3. 在 2008年,有很多關於蘋果的魅力首席執行官,史蒂夫賈伯斯,以及他與胰腺癌的鬥 爭的報導。 賈伯斯健康狀況惡化是經常討論的話題,許多商業報導推測蘋果如果沒有他,他 們的表現會如何。 這樣的報導聚集在這裡: 46

 • ANALYSIS-Apple success linked to more than just Steve Jobs • 分析 -

• ANALYSIS-Apple success linked to more than just Steve Jobs • 分析 - 蘋果的成功不僅僅與史蒂夫賈伯斯有關 • NEWSMAKER-Jobs used bravado, charisma as public face of Apple. NEWSMAKER-Jobs • 使用虛張聲勢,魅力作為Apple的公眾形象 • COLUMN-What Apple loses without Steve: Eric Auchard. • 專欄 - 蘋果在沒有史蒂夫的情況下輸了什麼:Eric Auchard • Apple could face lawsuits over Jobs' health • 蘋果公司可能會因賈伯斯的健康狀況而面臨訴訟 47

Cluster 4. This cluster contains various Apple announcements and releases. Superficially, these stories were

Cluster 4. This cluster contains various Apple announcements and releases. Superficially, these stories were similar, though the specific topics varied: 分群 4. 此分群包含各種Apple公告和發布。 從表面上看,這些報導類似,但具體主題各不相同: 48

 • Apple introduces i. Phone "push" e-mail software. Apple • 推出了i. Phone“推送”電子郵件軟件 •

• Apple introduces i. Phone "push" e-mail software. Apple • 推出了i. Phone“推送”電子郵件軟件 • Apple CFO sees 2 nd-qtr margin of about 32 pct. • 蘋果首席財務官認為第二季度利潤率約為 32% • Apple says confident in 2008 i. Phone sales goal • 蘋果表示對2008年i. Phone銷售目標充滿信心 • Apple CFO expects flat gross margin in 3 rd-quarter • 蘋果首席財務官預計第三季度毛利率將持平 • Apple to talk i. Phone software plans on March 6 • 蘋果將在 3月6日談論i. Phone軟件計劃 49

Cluster 5. This cluster’s stories were about the i. Phone and deals to sell

Cluster 5. This cluster’s stories were about the i. Phone and deals to sell i. Phones in other countries. 分群 5. 這個分群的報導是關於i. Phone和在其他國家銷售i. Phone的交易。 Cluster 6. One class of stories reports on stock price movements outside of normal trading hours (known as Before and After the Bell). 分群 6. 一類報導報告正常交易時間之外的股價波動(稱為貝爾之前和之後) Cluster 7. This cluster contained little thematic consistency. 分群 7. 該分群幾乎沒有主題一致性。 50

Cluster 8. Stories on i. Tunes and Apple’s position in digital music sales formed

Cluster 8. Stories on i. Tunes and Apple’s position in digital music sales formed this cluster. 分群 8. i. Tunes上的報導和Apple在數字音樂銷售中的地位形成了這個分群。 Cluster 9. A particular kind of Reuters news story is a News Brief, which is usually just a few itemized lines of very terse text (e. g. “ • Says purchase new movies on itunes same day as dvd release”). The contents of these New Briefs varied, but because of their very similar form they clustered together. 分群 9. 一種特殊類型的路透社新聞報導是新聞摘要,通常只是一些非常簡潔的文本(例如“ • 在DVD發布的同一天在i. Tunes上購買新電影”)。 這些新簡報的內容各不相同,但由於它們 非常相似,因此它們聚集在一起。 51

The news story clusters • As we can see, some of these clusters are

The news story clusters • As we can see, some of these clusters are interesting and thematically consistent while others are not. • 我們可以看到,其中一些分群很有趣,主題一致,而其他分群則不然。 • Cliché in statistics: Correlation is not causation, meaning that just because two things co-occur doesn’t mean one causes another. • 統計學中的陳詞濫調:相關性不是因果關係,這意味著僅僅因為兩件事共同發生並不意 味著一件事引起另一件事。 52

The news story clusters • Similar caveat in clustering: Syntactic similarity is not semantic

The news story clusters • Similar caveat in clustering: Syntactic similarity is not semantic similarity. Just because two things---particularly text passages---have common surface characteristics doesn’t mean they are necessarily related semantically. • 分群中的類似警告:語法相似性不是語意相似性。 僅僅因為兩件事 - 特別是文本段落 具有共同的表面特徵並不意味著它們必然在語義上相關。 • We shouldn’t expect every cluster to be meaningful and interesting. • 我們不應期望每個分群都是有意義和有趣的。 • Nevertheless, clustering is often a useful tool to uncover structure in our data that we did not foresee. • 然而,分群通常是一種有用的 具,可以揭示我們未預見到的數據結構。 53

Understanding the Results of Clustering • As we mentioned above, the result of clustering

Understanding the Results of Clustering • As we mentioned above, the result of clustering is either a dendrogram or a set of cluster centers plus the corresponding data points for each cluster. • 如上所述,分群的結果是樹形圖或一組分群中心加上每個分群的相應數據點。 • How can we understand the clustering? 我們如何理解分群? • This is particularly important because clustering often is used in exploratory analysis, so the whole point is to understand whether something was discovered, and if so, what? • 這一點特別重要,因為分群通常用於探索性分析,所以重點是要了解是否發現了某些東 西,如果是,那又是什麼? • How to understand clustering and clusters depends on the sort of data being clustered and the domain of application, but there are several methods that apply broadly. • 如何理解分群和分群取決於分群的數據類型和應用程序域,但有幾種方法可以廣泛應用。 54

Consider our whiskey example again • Whiskey researchers Lapointe and Legendre cut their dendrogram

Consider our whiskey example again • Whiskey researchers Lapointe and Legendre cut their dendrogram into 12 clusters; here are two of them: • 威士忌研究人員 Lapointe 和 Legendre 將其樹狀圖切成 12個分群; 這裡 有兩個: • Group A • Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa • Group H • Scotches: Bruichladdich, Deanston, Fettercairn, Glenfiddich, Glen Mhor, Glen Spey, Glentauchers, Ladyburn, Tobermory 55

Understanding the Results of Clustering • The more important factor to understanding these clusters—at

Understanding the Results of Clustering • The more important factor to understanding these clusters—at least for someone who knows a little about single malts—is that the elements of the cluster can be represented by the names of the whiskeys. In this case, the names of the data points are meaningful in and of themselves, and convey meaning to an expert in the field. • 理解這些分群的更重要因素 - 至少對於了解單一麥芽威士忌的人來說 - 是分群的元素可 以用威士忌的名稱來表示。 在這種情況下,數據點的名稱本身是有意義的,並且向本 領域的專家傳達意義。 • What can we do in cases where we cannot simply show the names of our data points, or for which showing the names does not give sufficient understanding? • 如果我們不能簡單地顯示我們的數據點的名稱,或者顯示名稱沒有充分理解的情況,我 們可以做些什麼呢? 56

Consider whiskey example. • Group A • • • Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa

Consider whiskey example. • Group A • • • Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa The best of its class: Laphroaig (Islay), 10 years, 86 points Average characteristics: full gold; fruity, salty; medium; oily, salty, sherry; dry 同類中最好的:Laphroaig(艾萊島),10年,86分 平均特徵:全金; 果味,咸; 中; 油膩,咸,雪利酒; 幹 • Group H • Scotches: Bruichladdich, Deanston, Fettercairn, Glenfiddich, Glen Mhor, Glen Spey, Glentauchers, Ladyburn, Tobermory • The best of its class: Bruichladdich (Islay), 10 years, 76 points • Average characteristics: white wyne, pale; sweet; smooth, light; sweet, dry, fruity, smoky; dry, light • 同類中最好的:Bruichladdich(艾萊),10年,76分 • 平均特徵:白色Wyne,蒼白; 甜; 光滑,輕盈; 甜,幹,果味,煙熏; 乾燥,輕盈 57

Understanding the Results of Clustering • Here we see two additional pieces of information

Understanding the Results of Clustering • Here we see two additional pieces of information useful for understanding the results of clustering. • 在這裡,我們看到另外兩條有助於理解分群結果的信息。 • First, in addition to listing out the members, an“exemplar” member is listed. • 首先,除了列出成員外,還列出了“範例”成員。 • Here it is the “best of its class” whiskey, taken from Jackson (1989). • 這是傑克遜(1989)拍攝的“同類最佳”威士忌。 • Alternatively, it could be the best known or highest-selling whiskey in the cluster. • 或者,它可能是分群中最知名或最暢銷的威士忌。 58

Solving a Business Problem Versus Data Exploration • We separate our problems into supervised

Solving a Business Problem Versus Data Exploration • We separate our problems into supervised (e. g. , predictive modeling) and unsupervised (e. g. , clustering). • 我們將我們的問題分為監督(例如,預測建模)和無監督(例如,分群)。 • There is a direct trade-off in where and how effort is expended in the data mining process. • 在數據挖掘過程中花費的地點和方式有直接的權衡。 • For the supervised problems, since we spent so much time defining precisely the problem we were going to solve, in the Evaluation stage of the data mining process we already have a clear-cut evaluation question: do the results of the modeling seem to solve the problem we have defined? • 對於監督問題,由於我們花了很多時間精確定義我們要解決的問題,在數據挖掘過程的 評估階段我們已經有了一個明確的評估問題:建模的結果似乎解決了 我們定義的問題 ? 59

The CRISP data mining process 60

The CRISP data mining process 60

Solving a Business Problem Versus Data Exploration • In contrast, unsupervised problems often are

Solving a Business Problem Versus Data Exploration • In contrast, unsupervised problems often are much more exploratory. • 相比之下,無監督問題往往更具探索性。 • We may have a notion that if we could cluster companies, news stories, or whiskeys, we would understand our business better, and therefore be able to improve something. • 我們可能會認為,如果我們能夠將公司,新聞報導或威士忌聚集在一起,我們就能更好 地了解我們的業務,從而能夠改進某些方面。 • However, we may not have a precise formulation. • 但是,我們可能沒有精確的表述/問題無法很清楚的定義。 • There is a trade-off. The tradeoff is that for problems where we did not achieve a precise formulation of the problem in the early stages of the data mining process, we have to spend more time later in the process —in the Evaluation stage. • 有一個權衡。權衡的是,對於我們在數據挖掘過程的早期階段沒有實現問題的精確公式 的問題,我們必須在評估階段花費更多時間在流程中。 61

Solving a Business Problem Versus Data Exploration • For clustering, specifically, it often is

Solving a Business Problem Versus Data Exploration • For clustering, specifically, it often is difficult even to understand what (if anything) the clustering reveals. • 特別是對於分群,甚至很難理解分群揭示的內容(如果有的話)。 • Even when the clustering does seem to reveal interesting information, it often is not clear how to use that to make better decisions. • 即使分群似乎確實揭示了有趣的信息,通常也不清楚如何使用它來做出更好的決策。 • Therefore, for clustering, additional creativity and business knowledge must be applied in the Evaluation stage of the data mining process. • 因此,對於分群,必須在數據挖掘過程的評估階段應用額外的創造力和業務知識。 62