kMeans and DBSCAN Gyozo Gidofalvi Uppsala Database Laboratory
k-Means and DBSCAN Gyozo Gidofalvi Uppsala Database Laboratory Gyozo Gidofalvi
Announcements § Updated material for assignment 2 on the lab course home page. § Posted sign up sheets for labs and examinations for assignment 2 outside P 1321. § Posted office hours 2020 10 02 Gyozo Gidofalvi 2
k-Means § Input • M (set of points) • k (number of clusters) § Output • µ 1, …, µk (cluster centroids) § k Means clusters the M point into K clusters by minimizing the squared error function clusters Si; i=1, …, k. µi is the centroid of all xj Si. 2020 10 02 Gyozo Gidofalvi 3
k-Means algorithm select (m 1 … m. K) randomly from M % initial centroids do (µ 1 … µK) = (m 1 … m. K) all clusters Ci = {} for each point p in M % compute cluster membership of p [i] = argminj(dist(µj, p)) % assign p to the corresponding cluster: Ci = Ci {p} end for each cluster Ci % recompute the centroids mi = avg(p in Ci) while exists mi µi % convergence criterion 2020 10 02 Gyozo Gidofalvi 4
K-Means on three clusters 2020 10 02 Gyozo Gidofalvi 5
I’m feeling Unlucky Bad initial points 2020 10 02 Gyozo Gidofalvi 6
kmeans in practice § How to choose initial centroids • select randomly among the data points • generate completely randomly § How to choose k • study the data • run k Means for different k § measure squared error for each k Run kmeans many times! • Get many choices of initial points 2020 10 02 Gyozo Gidofalvi 7
k-Means iteration step in Amos. QL § Calculate point to centroid distances: calp 2 c_distance(…) select p, c, d from Vector of Number p, Vector of Number c, Number d where p in bag({iota(1, 10)}) and c in bag({iota(1, 10)}) and d = euclid(p, c); § Assign each point to the closest centroid: calc_cluster_assignment(…) groupby((p 2 c_distances 1(…)), #’argminv’); § Recalculate centroids: calc_clust_means(…) groupby(calc_cluster_assignment 1(…), #’col_means’); 2020 10 02 Gyozo Gidofalvi 8
Transitive closure § tclose is a second order function to explore graphs where the edges are expressed by a transition function fno tclose(Function fno, Object o)->Bag of Object § fno(o) produces the children of o § tclose applies the transition function fno(o), then fno(o)), then fno(fno(o))), etc until fno returns no new results 2020 10 02 Gyozo Gidofalvi 9
Iterate until convergence with tclose in Amos. QL create function bagidiv 2(Bag of Number b) ->Bag of Number as (select floor(n/2) from Number n where n in b); create function vecchild_idiv 2(Vector of Number vb) ->Bag of Vector of Number as sort(bagidiv 2(in(vb))); create function vecconverge_tclose(Bag of Number ib) ->Bag of Vector of Number /* tclose function iterating the bagchild_idiv 2 function until convergence */ as select ov from Vector of Number ov where ov in tclose(#'vecchild_idiv 2', sort(ib)); 2020 10 02 Gyozo Gidofalvi 10
What about this? ! Non-spherical clusters Noise 2020 10 02 Gyozo Gidofalvi 11
k-Means pros and cons + Easy Fast Scalable? 2020 10 02 Works only for ”well shaped” clusters Sensitive to outliers Sensitive to noise Must know k a priori Gyozo Gidofalvi 12
Questions § Euclidean distance results in spherical clusters • What cluster shape does the Manhattan distance give? • Think of other distance measures too. What cluster shapes will those yield? § Assuming that the K means algorithm converges in I iterations, with N points and X features for each point • give an approximation of the complexity of the algorithm expressed in K, I, N, and X. § Can the K means algorithm be parallelized? • How? 2020 10 02 Gyozo Gidofalvi 13
DBSCAN § Density Based Spatial Clustering of Applications with Noise § Basic idea: • If an object p is density connected to q, § then p and q belong to the same cluster • If an object is not density connected to any other object § it is considered noise 2020 10 02 Gyozo Gidofalvi 14
Definitions § e neigborhood • The e-neigborhood of an object p is the set of objects within e-distance of p § core object An object q is a core object iff there at least Min. Pts objects in q’s e neighbourhood § directly density reachable (ddr) An object p is ddr from q iff q is a core object and p is inside the e neighbourhoodof q p q e 2020 10 02 Gyozo Gidofalvi 15
Reachability and Connectivity q 2 § density reachable (dr) An object p is dr from q iff p there exists a chain of objects q 1 … qn s. t. q 1 is ddr from q, q 2 is ddr from q 1, q 3 is ddr from … and p is ddr from qn q 1 q § density connected (dc) p is dc to r iff exist an object q such that p is dr from q and r is dr from q 2020 10 02 Gyozo Gidofalvi q r p 16
Recall… § Basic idea: • If an object p is density connected to q, § then p and q belong to the same cluster • If an object is not density connected to any other object § it is considered noise 2020 10 02 Gyozo Gidofalvi 17
DBSCAN i = 1 do take a point p from M find the set of points P which are density connected to p if P = {} M = M {p} HOW? else Ci=P i=i+1 p M = M P end while M {} 2020 10 02 Gyozo Gidofalvi 18
Fining density connected componnets § If r is dc to p there exists q, s. t. both p and r are dr from q. i. e. , there exists a ddrchain from q to both r and p and q is a core object. § Recall: tclose is a second order function to explore graphs where the edges are expressed by a transition function fno. § fno = ddr 2020 10 02 Gyozo Gidofalvi 19
Fining dc components in Amos. QL § Assuming q is a core object and the a ddr function with the following signature is defined: ddr(Integer q)->Bag of Integer p § Then: create function dc(Integer q)->Bag of Integer as select p from Integer p where p in tclose(#’ddr’, q); 2020 10 02 Gyozo Gidofalvi 20
DBSCAN pros and cons + Clusters of Requires connected arbitrary shape regions of sufficiently high density Robust to noise Does not need Data sets with varying an a priori k densities are problematic Deterministic Scalable? 2020 10 02 Gyozo Gidofalvi 21
Questions § Why is the dc criterion useful to define a cluster, instead of dr or ddr? § For which points are density reachable symmetric? i. e. for which p, q: dr(p, q) and dr(q, p)? § Express using only core objects and ddr, which objects will belong to a cluster 2020 10 02 Gyozo Gidofalvi 22
- Slides: 22