Distributions cont Continuous and Multivariate Distribution numeric attribute
Distributions cont. : Continuous and Multivariate
Distribution, numeric attribute g g g Continuous data potentially has infinite domain g probability of specific values is zero g probabilities over intervals, e. g. (-∞, x] Cumulative distribution function CDF g FX(x) = P(X ≤ x) Probability density function PDF g g g first derivative of CDF relative density of points for each value density is not probability
Histograms g g g Estimate density in a discrete way Define cut points and count occurrences within bins How to choose cut points g equal width: cut domain (min->max) up in k equal size intervals g equal height: select k cut points such that all bins contain (approximately) n/k data points
Kernel Density Estimation g g Estimating the density (of the population) from the sample Observed data is smoothed over numeric domain by means of a kernel (often Gaussian)
Entropy of continuous attribute g g g Differential entropy Generalisation of entropy to continuous case somewhat problematic Uniform distribution over [0, a]: H(X) = lg(a) g a = ½ => H(X) = lg(½) = -1 ?
Multivariate Distributions
Joint distributions g g How frequent are combinations of values? Confusion matrix (contingency table, cross table) g g counts each combination complete information X g g Y T F T 0. 42 0. 13 0. 55 F 0. 12 0. 33 0. 45 0. 54 0. 46 1. 0 univariate distribution of X (marginal distribution) 2 attributes: how informative is one attribute about the other? Quantifying information between attributes: joint entropy, mutual information, information gain, …
Some joint distributions g X and Y are independent g g g = = 0. 6 0. 8 0. 6 0. 2 0. 4 0. 8 0. 4 0. 2 Y depends on X g g g 0. 48 0. 12 0. 32 0. 08 higher counts along diagonal both diagonals possible X fully determines Y T F T 0. 48 0. 32 0. 8 F 0. 12 0. 08 0. 2 0. 6 0. 4 1. 0 T F T 0. 42 0. 13 0. 55 F 0. 12 0. 33 0. 45 0. 54 0. 46 1. 0 T F T 0. 4 0 0. 4 F 0 0. 6 0. 4 0. 6 1. 0
Capturing multivariate continuous distributions g 2 -dimensions g Problematic in higher dimensions
Joint distribution over numeric x binary g Of specific relevance in Data Mining g g classification How does the class (T/F) depend on a numeric attribute?
- Slides: 10