HKU Department of Computer Science Database Research Seminar

HKU Department of Computer Science Database Research Seminar 18 th May 2006 Density-Based Clustering of Uncertain Data (KDD 2005) Authors: Hans-Peter Kriegel and Martin Pfeile Presenter: Chui Chun Kit (Mphil) ckchui@cs. hku. hk http: //www. cs. hku. hk/~ckchui Supervisor: Dr. Benjamin C. M. Kao.

Presentation Outline n Introduction n n Issues from mining certain data to uncertain data n n n Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? Theoretical foundation of changing DBSCAN to FDBSCAN n n What is clustering? Density based similarity measurment DBSCAN From DBSCAN to FDBSCAN Computational Issues Experimental Results Conclusions

Introduction

What is Clustering? n Problem description n n A set of objects A similarity measurement Discover groups of similar objects More precisely, find sets of objects which intra-cluster similarity is high while interclusters similarity is relatively low.

Different Clusters Discovered by Different Similarity Measurement n n Distance-based Density-based Pattern-based …etc

Density-based clustering y Any clusters ? x n n The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster. The clusters are separated by low object density regions (noise)

Density-based clustering can detect arbitrary cluster shapes n n The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster. The clusters are separated by low object density regions (noise)

Key idea of density-based clustering n n Density constraint for objects to form clusters Intuitively for each object of a cluster the neighborhood of a given radius has to contain at least a minimum number of objects. (density constraint) i. e The density in the neighborhood has. to exceed some threshold. Objects not belong to any clusters are regard as noise.

Previous Works on Density Based Clustering n DBSCAN n n A density-based clustering algorithm Work on data with no uncertainty Will present the uncertainty version of DBSCAN later

DBSCAN n Two important definitions of DBSCAN n n Core objects Directly-density reachable Density reachable (skip) Density connected (skip) For the sake of discussion, these two definitions are skipped

DBSCAN Definition 1: Core Object n n n Given the density constraint (µ andε) An object o is defined as a core object iff there are µ or more objects within theε-range of o. Basically, we can conduct a range search on object o with radius ε, if there are µ or more objects returned, then o is a core object.

DBSCAN Definition 1: Core Object n n Example (µ=5 ) Is o 1 a core object? o 2 ε o 1 ε Sincethereare 55 objectswithin Since theε-rangeof ofo 2, o 1, o 2 o 1 isisaacore the objecttoo. object

DBSCAN Definition 2: Directly-density reachable n An object p is directly-density reachable from o if the following conditions are satisfied n n 1 st condition: o is a core object 2 nd condition: d(p, o) ≤ε

DBSCAN Definition 2: Directly-density reachable n Example (µ=5 ) n Question: Is o 2 directly-density reachable from o 1? Thus, o 2 is directly-density reachable from o 1 o 2 2 nd condition: Is d(o 2, o 1) ≤ε ? Yes, it is within the ε-range of o 1 ε 1 st condition: Is o 1 a core object? Since there are 5 objects within the εrange of o 1, o 1 is a core object

DBSCAN How it works? Brief idea… n n n Search for clusters by checking the εneighborhood of each object in the database. If a core object o is found, a new cluster with o and it’s direct density-reachable objects is created. DBSCAN iteratively collects the directly density-reachable objects from the objects in the cluster.

DBSCAN n Pick another for next Eventually, clusterspoint are formed iteration the current Objects that notif assigned to cluster any does expand. clusters are not regarded as noise Example (µ=5 ) ε ε ε o 1 o 2 ε ε a 1 ε DBSCAN continues to object, “expand” Arbitrary pick anot point, e. g. o 1, a 1 isobject a core o 1 Since is a core the by adding objects check if NOT it with is adirect-density core object… a 2 iscluster A cluster o 1 and all o 1’s which are directly density reachable from a 1. density reachable objects cluster objects a 2 is NOT from added into the cluster a 2 reachable

From Certain Data to Uncertain Data

From certain to uncertain data Five major issues … n n n Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? What is core object in uncertain data? What is direct density-reachable in uncertain data?

Why data exhibit uncertainty? n n In many modern application ranges, e. g. the clustering of moving objects or sensor databases, only uncertain data is available. For instance, in the area of mobile services, the objects continuously change their positions so that exact positional information is often not available.

Why data exhibit uncertainty? n In application areas such as clustering of distributed feature vectors, due to security aspects or to limited bandwidth, only approximated information is transmitted to a central server site.

Uncertain Data (Example) n n n Somewhere in a tropical rain forest… Location tracking of a group of about 300 Chimpanzees. Implanted device reports location of a Chimpanzee regularly. However the reported location is not precise, it only return the area the Chimpanzee is located. The area is called an uncertainty region Assume the probability that the Chimpanzee located in any location inside the uncertainty region is the same.

Uncertain Data (Example) n n The Chimpanzee society is complicated, some young Chimpanzees may gather to fight against the leader. Zoologists are interested to study the factors that affect the formation of different groups (clusters) inside the Chimpanzee society.

Uncertain Data (Example) n n n One observation is that Chimpanzees of the same group usually stay closely together. Assume that one Chimpanzee belongs to one group only. Density based clustering can help to discover the Chimpanzee groups (clusters).

Uncertain Data (Example) y Clusters x Somewhere in the tropical rain forest… Uncertainty region of 15 Chimpanzees reported by the location tracking devices (location of each Chimpanzee)

From certain to uncertain data Five major issues… n n n Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? What is core object in uncertain data? What is direct density-reachable in uncertain data?

Representing Uncertain Objects Probability density functions of 1 -D objects Value (e. g. temperature) y probability Probability density functions for 2 -D objects x

Representing Uncertain Objects Question: What is the distance between ouncertain and o’uncertain? Probability density functions of 1 -D objects Area a n b Value (e. g. temperature) value The probability that an object o is having a value between a and b can be obtained by

From certain to uncertain data Five major issues … n n n Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? What is core object in uncertain data? What is direct density-reachable in uncertain data?

How to represent the distance between uncertain objects? n n n Distance Density Function pd(o, o’) Distance Distribution Function Pd(o, o’)(b) Distance expectation value Ed(o, o’) n n Aggregated value Information loss

Distance Density Function pd(o, o’) n n Express the distance between two objects by means of a probability density function. Let d be a distance function. Let P(a≤d(o, o’)≤b) denote the probability that d(o, o’) is between a and b. A probability density function pd(o, o’) is called a distance density function if the following condition holds:

Distance Density Function pd(o, o’) Probability density functions (pdf) of each uncertain data item is considered independent. Value (e. g. temperature) probability Distance density function express the pd(o, o’) (dis) = Pd (o, o’) (dis) 0 distance between two uncertain objects by mean of pdf. dis Distance between o and o’

Distance Density Function pd(o, o’) probability 0 Distance Density Function (represents the distance between two uncertain objects) pd (o, o’) Distance between o and o’

Distance Density Function pd(o, o’) n From the distance density function, the probability that the distance between two uncertain objects is between a and b is given by probability |Area=| = 1 P(a≤d(o, o’)≤b) Minumum possible distance between o and o’ pd (o, o’) 0 a b Maximum possible distance between o and o’ Distance between o and o’

How to represent the distance between uncertain objects? n n n Distance Density Function pd(o, o’) Distance Distribution Function Pd(o, o’)(b) Distance expectation value Ed(o, o’) n n Aggregated value Information loss

Distance Distribution Function n n Captures the probability that the distance between two uncertain objects is smaller than or equal to a value b. Useful in density-based clustering, when expressing the probability that the 2 nd condition for directly density d(o’, o) ≤b. reachable in DBSCAN

Distance Distribution Function n In density-based clustering, when evaluating whether an object o’ is directly densityreachable from o, we may want to ask What is the probability that o and o’ are close to each other? Probability density functions (pdf) i. e. distance between o and o’ smaller than or equal to b? The distance distribution function Pd(o, o’)(b) is the answer. o’ o

Distance Distribution Function n The distance distribution function Pd(o, o’)(b) is equal to the integration of the distance density function pd(o, o’) from negative infinity to b. probability Distance Density Function pd (o, o’) 0 b Distance between o and o’

How to represent the distance between uncertain objects? n n n Distance Density Function pd(o, o’) Distance Distribution Function Pd(o, o’)(b) Distance Expectation Value Ed(o, o’) n n Aggregated value Information loss

Distance Expectation Value Ed(o, o’) n n n Represent the distance between two uncertain objects by one numerical value. Advantage: Since the distance between two uncertain objects is represented by a single value, traditional clustering algorithms work. E. g. DBSCAN Disadvantage: Information loss Distance density function Average distance between two objects aggregated from the distance density function

From certain to uncertain data Five major issues … n n n Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? What is core object in uncertain data? What is direct density-reachable in uncertain data?

Theoretical Foundations I Core Object Probability n n Let denotes the probability that an object o is a core object. Core object probability of an object o is given by the following formula We start derive this formula from the core object definition of DBSCAN…

Theoretical Foundations I Core Object Probability n n n In DBSCAN, an object o is a core object if the density constraint (µ andε) is satisfied. i. e. There are µ or more objects p within the ε-range of o. (d(p, o) ≤ε) The probability that an object o is a core object is the probability that the density constraint is satisified. The probability that there are µ or more objects p with d(p, o) ≤ε

Theoretical Foundations I Core Object Probability Example µ=5 p If ε is this small, Sometime, large, d(p, o) what ≤εandis sometime obviously, the core object d(p, o) core-object probability ≥ε probability of o? of o is 1 o What is the core object probability of o? ε Probability density functions (pdf)

Theoretical Foundations I Core Object Probability n For each subset A of the database D which having the cardinality higher than or equal to µ.

Theoretical Foundations I Core Object Probability n For each subset A of the database D which having the cardinality higher than or equal to µ n Determine the probability that only the objects p of A with d(p, o) ≤ε but no other objects in DA. The probability that only the objects p of A having d(p, o) ≤ε but no other objects in DA

Theoretical Foundations I Core Object Probability n Remind that is the probability that the distance between two uncertain objects is smaller than or equal to a value b. First part: Probability that ALL objects p in A with d(p, o) ≤ε Second part : Probability that ALL objects p in DA are NOT d(p, o) ≤ε The probability that only the objects p of A having d(p, o) ≤ε but no other objects in DA

From certain to uncertain data Five major issues … n n n Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain objects? What is core object in uncertain data? What is direct density-reachable in uncertain data?

Theoretical Foundations II Reachability Probability n n Let be the probability that p is reachable from o. In DBSCAN, an object p is directly density reachable form o if n n The twotwo These events conditions are 1 st condition : o is a core object Dependent NOT independent! to each other ! Incorrect, why? 2 nd condition : d(p, o) ≤ε ×

Theoretical Foundations II Reachability Probability p o In this case, The probability that o is a core object is depend on the probability that d(p, o) ≤ε i. e. 1 st and 2 nd conditions are NOT independent. q ε –range of o Probability density functions (pdf) n Incorrect, why? Example (µ=3) These The twotwo events conditions are Dependent NOT independent! to each other ! ×

Theoretical Foundations II Reachability Probability p 1 st condition We consider the core object probability in Dp. And relax the density constraint µ by 1. o Their product corresponds to the q o’ from probability that at least µ objects D are having d(o’, o) ≤ε, and that object 2 nd condition p is one of them. We consider the probability Which correspond to the definition of that d(p, o) ≤ε directly density reachable in DBSCAN n Two independent conditions ×

Theoretical Foundations II Reachability Probability n The probability that at least µ-1 objects from Dp are located within anε-range of o is

Theoretical Foundations II Reachability Probability n n The probability that at least µ-1 objects from Dp are located within anε-range of o is The probability that the distance between p and o is smaller than or equal to ε is

Theoretical Foundations II Reachability Probability The probability that at least µ-1 objects from Dp having their distance with o smaller than or equal toε n n The probability that the distance between p and o is smaller than or equal to ε The two conditions are independent Their product corresponds to the probability that at least µ objects from D are located in ε - range of o, and that p is one of them.

How does FDBSCAN works? n n Traditional DBSCAN algorithm clusters a data set by always adding objects to the current cluster which are directly density reachable from the current query object o. FDBSCAN works very similar to the traditional approach.

How does FDBSCAN works? n For each uncertain object o n n Check if it is a core object If yes, for each other object p n n n Check the reachability of p from o If the reachability probability ≥ 0. 5, p and o form a cluster There are O(|DB|2) reachability probability computations

Computational Aspect I Computing the reachability probability

Computational Aspect I Computing n Reachability Probability n Core Object Probability n Distance Density Function Integration

Computational Aspect I Computing n n Direction 1: Avoid calculating the integration Sampling n Monte-carlo sampling n Each uncertain object o is represented by a sequence of s sample points. i. e. <o 1, o 2, …os> n Compute base on the sample sequences. n How it can be done? (If time allowed)

Computational Aspect I Computing n Direction 2: Reduce the number of reachability probability computations. n n Some objects maybe located very far away from o, which is obviously no chance to be directly density -reachable from o. Use MBRs to bound the object samples n n Compute for all objects o, the MBR(o) bounding the sample points <o 1, o 2, …os> If MBR(p) is outside theε- range of o, p must NOT be direct density-reachable from o.

Computational Aspect II (If time allows) Computing Core Object Probability Interesting, but complicated, click here to skip!

Computational Aspect II Computing Core Object Probability n Two issues n n 1 st issue : There are many core object probability computations. 2 nd issue : In each core object probability computation, we have to consider (in |DB|) exponentially many subsets A of DB.

1 st Issue : Many Core object Probability Computations n For each uncertain object o n n Check the probability that o is a core object Core object probability ≥ 0. 5 n For each other object p n n , Check the reachability of p from o If the reachability probability ≥ 0. 5, p and o form a st cluster The 1 condition of reachability probability is a core object probability for all p in D

2 nd Issue: Exponentially many subsets to consider for each core-object value n Furthermore, the computation of core-object values has to consider (in |DB|) exponentially many subsets A of DB. For all subsets A in D with cardinality greater than or equal to µ

2 nd Issue: Exponentially many subsets to consider for each core-object value n Sampling n Monte-carlo sampling n Each uncertain object o is represented by a sequence of s sample points. i. e. <o 1, o 2, …os> n Compute sample sequences. n How it can be done? base on the

Compute base on the sample sequences n n s is the sample rate. <o 1, o 2, …os> Determine the core-object probability base on s 2 meaningful samples. n n n oj is called the j th instance of o. Dj is the collection of j th instance of all objects in D. E. g. s=5 n n a 1, a 2, a 3, a 4, a 5 b 1, b 2, b 3, b 4, b 5 c 1, c 2, c 3, c 4, c 5 d 1, d 2, d 3, d 4, d 5 D 1 = {a 1, b 1, c 1, d 1, e 1} D 2 ={a 2, b 2, c 2, d 2, e 2} …

Compute base on the sample sequences n If we want to compute the core object probability of o, create a s×s sample matrix M(o) n M(o) keep track of the information for deducing n n With some modification, it can be used to deduce Each cell mi, j of M(o) indicates the number of εneighbors of oi in Dk.

Create sample matrix M(o) (skip) n n Each cell mi, j contains the number of εneighbor of object sample oi in database instance Dj. Dj consists of all other objects’ j-th sample (excluding oj)

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) n o is the query n object All object samples are bounded by MBRs Sample rate=3 n µ=5 n d a b o 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) database instances 1 instances of o n 2 d 3 a 1 2 b 3 o 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) n Build M(o) d database instances of o 1 1 4 6 1 2 3 a a 1 b 3 a 3 2 3 a 2 b b 2 b 1 Although b 2 is ε-neighbor of o 1, it is How many ε-neighbors ofinodatabase 1 be in Dpruned 1? not counted as. MBR(a) it is NOT MBR(b) and cannot 6 is the final This indicates that instance Retrieve 1. theirvalue. sample sequences there are 6 ε-neighbors of object sample o 1 We are going fill m 1, 1 in database instance D 1. sure these three We are going m 1, 1 By min-max MBR dist, are b 1 and a 1 are ε-neighbors Since opruning, 1 fill itself iswe also counted, it is objects contain initialized to 1. ε-neighbors of o 1 in D 1 o 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 2 3 1 2 6 5 a 2 3 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 2 3 6 5 5 a 2 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 2 3 6 5 5 6 a 2 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 2 3 6 5 5 6 4 a 2 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 2 3 6 5 5 6 4 5 a 2 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 1 2 3 6 5 5 2 6 3 4 4 5 a 2 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 1 2 3 6 5 5 2 6 4 3 4 4 5 a 2 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Create sample matrix M(o) (Example: Sample rate=3, µ = 5) Build M(o) d database instances of o n 1 1 2 3 6 5 5 2 6 4 5 3 4 4 5 a 2 b 3 a 3 b b 2 o 1 b 1 Now we have the sample matrix M(o). a a 1 c o o 2 o 3

Compute base on the sample matrix M(o), (µ = 5) n For each uncertain object o n n Check the probability that o is a core object Core object probability ≥ 0. 5 n For each other object p n n Check the reachability of p from o If the reachability probability ≥ 0. 5, p and o form a cluster

Compute base on the sample matrix M(o), (µ = 5) n Core object probability n n 1 st Step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ 2 nd Step: Normalize the value by s^2 yields instances of o database instances 1 2 3 1 6 5 5 2 6 4 5 3 4 4 5 1 st Step: Count = 6 2 nd Step: Core-object probability of o = 6/9 Since the core object probability is > 0. 5, o is treated as a core-object

Compute base on the sample matrix M(o), (µ = 5) n For each uncertain object o n n Check the probability that o is a core object Core object probability ≥ 0. 5 n For each other object p n n Check the reachability of p from o If the reachability probability ≥ 0. 5, p and o form a cluster The first part Can be derived from M(o) The second part Can do some pruning using the object samples’ MBRs

Compute The first part n n n 1 st step: Decrease the values mi, j by 1 for which d(oi, pj)≤εholds. 2 nd step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1. 3 rd step: Normalizing the number by s 2 yield the probability

Computing the first part instances of o database instances 1 2 3 1 6 5 5 5 4 2 6 5 4 3 4 4 5 4 Decreasem 3, 3 m 2, 1 m 1, 1 by and 1 m 2, 3 m 1, 3 by by 11 d a 2 b 3 a 3 b b 2 1 st Step: decrease the values mi, j by 1 for which d(oi, pj)≤εholds. o 1 b 1 c Conceptually, M(o) contains the εneighbor information in D, we want it contains the information in Da. a a 1 o o 2 o 3

Computing the first part database instances of o 1 2 3 1 5 5 4 2 5 4 4 3 4 4 4 d a 2 b 3 a 3 b b 2 o 1 b 1 nd Step: Count the number of 2 rd 3 Step: Since all the cell are elements in the sample matrix M(o) greater than or equal to 5 -1 =4, the which contain values higher than or first part probability is equal to 9/9 = equal to µ-1 1 a a 1 c o o 2 o 3

Compute The second part n n Count the number of events d(oi, pj)≤ε, and by normalizing the number by s×s. The MBRs of the object samples can be used for pruning.

Computing the second part n n 1 st Step: Count the Number of events d(oi, pj)≤ε Count = 2 + 1 =5 d a 2 b 3 a 3 b b 2 2 nd Step: Normalize the count by s^2. The reachability probability of a from o is 5/9. a a 1 o 1 b 1 c o o 2 o 3

Reachability of a from o = 1 × 5/9 = 5/9 n n Since ≥ 0. 5, p is directly density reachable from o. p and o form a cluster.

Reachability of other objects from o instances of o database instances 1 2 3 1 6 5 5 2 6 4 5 3 4 4 5 d a 2 a a 1 b 3 a 3 b b 2 o 1 b 1 c o o 2 o 3

Experimental Evaluation

Experimental Evaluation n Datasets n Artificial data set (ART) n n 1000 2 -dimensional objects which are normally distributed in [0, 1] Each object is randomly surrounded by a box having a side length of p<1 in each dimension (Data fuzziness) Assume uniform probability distribution within the box Engineering data set (PLANE) n n 5000 42 -dimensional objects Normalized

Experimental Evaluation n Implementation n n FDBSCAN EXPDBSCAN n n Represent the distance between two uncertain objects by a single distance expectation value Ed(o, o’). Use the traditional DBSCAN algorithm to mine the data.

Experimental Evaluation n Implementation n n Java 1. 4 Window platform 730 MHz processor 512 MB main memory Sample rate s = 5

Experiment 1 Efficiency of the FDBSCAN Measure the runtimes of FDBSCAN and EXPDBSCAN on ART dataset (s) n. Runtime p=0. 01 n n Little fuzziness in the datasets Does EXPDBSCAN applied MBR pruning strategies as FDBSCAN?

Experiment 2 Effectively of FDBSCAN n n Measures the relation between the quality of the cluster results and data fuzziness of FDBSCAN and EXPDBSCAN. How to measure the quality of clusters? n n Treat as a black box for the time being… Good cluster will have the quality value close to 1, vice versa

Experiment 2 Effectively of FDBSCAN returns clusters quality fall than In ART, EXPDBSCAN performs quite but for with The quality ofwell, EXPDBSCAN andbetter FDBSCAN in EXPDBSCAN in all data fuzziness and number of high dimensional data, itshigh quality much worse dataisfuzziness, however, the degree of falling of dimensions. i. e. more than the FDBSCAN approach. FDBSCAN is smaller effective than EXPDBSCAN.

Experiment 3 Accuracy of the core object classification n How accurate do FDBSCAN and EXPDBSCAN classify core object? Precision and recall rate of core object Precision shows how precise the reported core set of core objects is. n n # reported real core objects / #of core objects reported Recall shows the percentage of real core objects reported. n #reported real core objects/ total # of real core objects in D

Experiment 3 Accuracy of the core object classification Very few has real acore objects are The precision and recall rate are FDBSCAN higher precision The for precision and recall rate of found EXPDBSCAN, however not 100% because FDBSCAN use and recall rate of core object in 2 D FDBSCAN increases in high nearly most of the returned core sampling approach for calculating ART dataset. EXPDBSCAN has a lower recall dimension. Why? objects are real core objects the core object probability rate than FDBSCAN. Why?

Why EXPDBSCAN suffer from low recall rate? (Example µ=5) 9 Probability density function Gaussian Distribution 2 10 1 8 5 B Core point candidates 3 6 A 4 7

Why EXPDBSCAN suffer from low recall rate? (Example µ=5) 9 2 10 1 8 5 B Number of ε-neighbor = 5 A is a core object ε ε 6 A Number of ε-neighbor = 4 B is NOT a core object 3 4 7

Conclusion n Demonstrated how density based clustering can be carried out based on uncertain information. Presented theoretical foundations for density based clustering of uncertain data. FDBSCAN work on the fuzzy distance function directly instead of working on lossy aggregated information.

My comments n We also want to know… n n The relationship between the sample rate and the execution time, a higher sample rate should suggest a more accurate result, but generally it tradeoffs with execution time. What is the relationship between these two parameters? Sample rate vs cluster quality Sample rate vs data dimensionality, which is a reference to determine the sample rate based on the data characteristic Sample rate vs fuzziness of data n n n Since we represent each uncertain object by MBRs, the MBR(o) are bounding the samples of o This means that the MBR(o) may not bounding the whole uncertainty region of o In high data fuzziness, MBR(o) may not precisely indicate the uncertainty region of the real object o.

Something confused… n We also want to know… n n The reason for using 0. 5 probability to determine core object is questionable. Why don’t treat this as a parameter? A higher value should suggests more false negative core objects, a lower value suggests more false positive core objects.

The End Thank you