Probability Densities in Data Mining Andrew W Moore
Probability Densities in Data Mining Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www. cs. cmu. edu/~awm awm@cs. cmu. edu 412 -268 -7599 Copyright © Andrew W. Moore Slide 1
Contenido • • Porque son importantes. Notacion y fundamentos de PDF continuas. PDFs multivariadas continuas. Combinando variables aleatorias discretas y continuas. Copyright © Andrew W. Moore Slide 2
Porque son importantes? • • • Real Numbers occur in at least 50% of database records Can’t always quantize them So need to understand how to describe where they come from A great way of saying what’s a reasonable range of values A great way of saying how multiple attributes should reasonably co-occur Copyright © Andrew W. Moore Slide 3
Porque son importantes? • • • Can immediately get us Bayes Classifiers that are sensible with real-valued data You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things Will introduce us to linear and non-linear regression Copyright © Andrew W. Moore Slide 4
A PDF of American Ages in 2000 Copyright © Andrew W. Moore Slide 5
Poblacion de PR por grupo de edad group 0 -4 5 -9 10 -14 15 -19 20 -24 25 -29 30 -34 35 -39 40 -44 45 -49 50 -54 55 -59 60 -64 65 -69 70 -74 75 -79 80 -84 85+ Copyright © Andrew W. Moore freq 284593 301424 305025 305577 299362 277415 262959 265154 258211 239965 233597 206552 169796 141869 112416 85137 57953 51801 midpoint freq. rela 2. 5 0. 0737516 7. 5 0. 0781133 12. 5 0. 0790465 17. 5 0. 0791895 22. 5 0. 0775789 27. 5 0. 0718914 32. 5 0. 0681452 37. 5 0. 0687140 42. 5 0. 0669147 47. 5 0. 0621863 52. 5 0. 0605361 57. 5 0. 0535274 62. 5 0. 0440022 67. 5 0. 0367650 72. 5 0. 0291323 77. 5 0. 0220630 82. 5 0. 0150184 87. 5 0. 0134241 Slide 6
Copyright © Andrew W. Moore Slide 7
A PDF of American Ages in 2000 Let X be a continuous random variable. If p(x) is a Probability Density Function for X then… = 0. 36 Copyright © Andrew W. Moore Slide 8
Properties of PDFs That means… Copyright © Andrew W. Moore Slide 9
Donde x-h/2<w<x+h/2). Luego, Asi p(w) tiende a p(x) cuando h tiende a cero Copyright © Andrew W. Moore Slide 10
Funcion de distribucion acumulativa. Esta es una funcion No decreciente Notar que p(w) tiende a p(x) cuando h tiende a 0 Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad Copyright © Andrew W. Moore Slide 11
Properties of PDFs Therefore… Copyright © Andrew W. Moore La dcerivada de una fucnion no dcecreciente es mayor o igual que cero. Slide 12
• Cual es el significado de p(x)? Si p(5. 31) = 0. 06 and p(5. 92) = 0. 03 Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5. 31 que a 5. 92. Copyright © Andrew W. Moore Slide 13
Yet another way to view a PDF A recipe for sampling a random age. 1. Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age, d) 2. If d < p(age) stop and return age 3. Else try again: go to Step 1. Copyright © Andrew W. Moore Slide 14
Test your understanding • True or False: Copyright © Andrew W. Moore Slide 15
Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X Copyright © Andrew W. Moore Slide 16
Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X E[age]=35. 897 = the first moment of the shape formed by the axes and the blue curve = the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error Copyright © Andrew W. Moore Slide 17
Expectation of a function =E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution. = the average value we’d see if we took a very large number of random samples of f(X) Note that in general: Copyright © Andrew W. Moore Slide 18
Variance s 2 = Var[X] = the expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally Copyright © Andrew W. Moore Slide 19
Standard Deviation s = Var[X] = the 2 expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally s = Standard Deviation = “typical” deviation of X from its mean Copyright © Andrew W. Moore Slide 20
Estadisticas para PR • • • E(edad)=35. 17 Var(edad)=501. 16 Desv. Est(edad)=22. 38 Copyright © Andrew W. Moore Slide 21
Funciones de densidad mas conocidas Copyright © Andrew W. Moore Slide 22
Funciones de densidad mas conocidas • • La densidad uniforme o rectangular La densidad triangular La densidad exponencial La densidad Gamma y la Chi-square La densidad Beta La densidad Normal o Gaussiana Las densidades t y F. Copyright © Andrew W. Moore Slide 23
La distribucion rectangular 1/w -w/2 Copyright © Andrew W. Moore 0 w/2 Slide 24
La distribucion triangular 0 Copyright © Andrew W. Moore Slide 25
The Exponential distribution Copyright © Andrew W. Moore Slide 26
La distribucion Normal Estandar Copyright © Andrew W. Moore Slide 27
La distribucion Normal General s=15 =100 Copyright © Andrew W. Moore Slide 28
General Gaussian Also known as the normal distribution or Bellshaped curve s=15 =100 Shorthand: We say X ~ N( , s 2) to mean “X is distributed as a Gaussian with parameters and s 2”. In the above figure, X ~ N(100, 152) Copyright © Andrew W. Moore Slide 29
The Error Function Assume X ~ N(0, 1) Define ERF(x) = P(X<x) = Cumulative Distribution of X Copyright © Andrew W. Moore Slide 30
Using The Error Function Assume X ~ N( , s 2) P(X<x| , s 2) = Copyright © Andrew W. Moore Slide 31
The Central Limit Theorem • • • If (X 1, X 2, … Xn) are i. i. d. continuous random variables Then define As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi] Somewhat of a justification for assuming Gaussian noise is common Copyright © Andrew W. Moore Slide 32
Estimadores de funcion de densidad Histograms h ancho de clase • K-nearest neighbors: dk es la distancia hasta el k-esimo vecino • Kernel density estimators • Copyright © Andrew W. Moore Slide 33
> x=c( 7. 3, 6. 8, 7. 1, 2. 5, 7. 9, 6. 5, 4. 2, 0. 5, 5. 6, 5. 9) > hist(x, freq=F, main="Estimacion de funcion de densidadhistograma") > rug(x, col=2) Copyright © Andrew W. Moore Slide 34
Estimacion de densidad por knn en 20 pts con k=1, 3, 5, 7 Copyright © Andrew W. Moore Slide 35
Estimación por kernels de una función de densidad univariada. En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x 1, …xn es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que Copyright © Andrew W. Moore Slide 36
donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2 h. La ecuación anterior puede ser escrita como: donde la función peso K está definida por 0 si |z|>1 K(z)= 1/2 si |z| 1 Copyright © Andrew W. Moore Slide 37
Muestra: 6, 8, 9 12, 20, 25, 18, 31 Copyright © Andrew W. Moore Slide 38
este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como: K(z) debe ser acotado y absolutamente integrable en (- , ) Usualmente, pero no siempre, K(z) 0 y simétrico, luego cualquier función de densidad simétrica puede usarse como kernel. Copyright © Andrew W. Moore Slide 39
Eleccion del ancho de banda h Donde n es el numero de datos y s la desviacion estandar de la muestra. Copyright © Andrew W. Moore Slide 40
EL KERNEL GAUSSIANO En este caso el kernel representa una función peso más suave donde todos los puntos contribuyen al estimado de f(x) en x. Es decir, Copyright © Andrew W. Moore Slide 41
EL KERNEL TRIANGULAR K(z)=1 - |z| para |z|<1, 0 en otro caso. EL KERNEL "BIWEIGHT" 15/16(1 -z 2)2 para |z|<1 0 en otro caso K(z)= Copyright © Andrew W. Moore Slide 42
EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso Copyright © Andrew W. Moore Slide 43
Estimacion de densidad en 20 pts usando kernel gaussiano con h=. 5, ”opt 1”, ”opt 2”, 4 Copyright © Andrew W. Moore Slide 44
Variables aleatorias bidimensionales p(x, y) = probability density of random variables (X, Y) at location (x, y) Copyright © Andrew W. Moore Slide 45
Estimadores de funcion de densidad bi-dimensionales Histogramas A area de la clase • K-nearest neighbors: Ak es el area ncluyendo hasta el k-esimo vecino • Kernel density estimators • Copyright © Andrew W. Moore Slide 46
Estimacion de kernel bivariado Sean xi=(x, y) los valores observados y t=(t 1, t 2) un punto del plano donde se desea estimar la densidad conjunta Si h 1=h 2=h Copyright © Andrew W. Moore Slide 47
• (a 1, a 2)H-1 donde Copyright © Andrew W. Moore Slide 48
Estimacion de densidad-Kernel Gaussiano bivariado Copyright © Andrew W. Moore Slide 49
f 1= kde 2 d(autompg 1$V 1, autompg 1$V 5, n=100) persp(f 1$x, f 1$y, f 1$z) Copyright © Andrew W. Moore Slide 50
contour(f 1, levels=c(8 e-6, 2 e-5, 2. 8 e-5), col=c(2, 3, 4), xlab="mpg", ylab="weight") Copyright © Andrew W. Moore Slide 51
In 2 dimensions Copyright © Andrew W. Moore Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… Slide 52
In 2 dimensions Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… P( 20<mpg<30 and 2500<weight<3000) = volumen under the 2 -d surface within the red rectangle Copyright © Andrew W. Moore Slide 53
In 2 dimensions Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… P( [(mpg-25)/10]2 + [(weight-3300)/1500]2 < 1 ) = volumen under the 2 -d surface within the red oval Copyright © Andrew W. Moore Slide 54
In 2 dimensions Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… Take the special case of region R = “everywhere”. Remember that with probability 1, (X, Y) will be drawn from “somewhere”. So. . Copyright © Andrew W. Moore Slide 55
In 2 dimensions Copyright © Andrew W. Moore Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… Slide 56
In m dimensions Copyright © Andrew W. Moore Let (X 1, X 2, …Xm) be an n-tuple of continuous random variables, and let R be some region of Rm … Slide 57
Independence If X and Y are independent then knowing the value of X does not help predict the value of Y mpg, weight NOT independent Copyright © Andrew W. Moore Slide 58
Independence If X and Y are independent then knowing the value of X does not help predict the value of Y the contours say that acceleration and weight are independent Copyright © Andrew W. Moore Slide 59
Multivariate Expectation E[mpg, weight] = (24. 5, 2600) The centroid of the cloud Copyright © Andrew W. Moore Slide 60
Multivariate Expectation > f 1= kde 2 d(autompg 1$mpg, autompg 1$weight, n=100) > dx=f 1$x[2]-f 1$x[1] > dy=f 1$y[2]-f 1$y[1] > dx [1] 0. 379798 > dy [1] 35. 62626 > meanmpg=sum(f 1$x*f 1$z)*dx*dy [1] 22. 48855 > meanweight=sum(f 1$y*f 1$z)*dx*dy [1] 2848. 638 >#estimated mean > mean(autompg 1$weight) [1] 2977. 584 > mean(autompg 1$mpg) [1] 23. 44592 Copyright © Andrew W. Moore Slide 61
Multivariate Expectation Copyright © Andrew W. Moore Slide 62
Test your understanding • All the time? Siempre • Only when X and Y are independent? • It can fail even if X and Y are independent? Copyright © Andrew W. Moore Slide 63
Bivariate Expectation Copyright © Andrew W. Moore Slide 64
Bivariate Covariance Copyright © Andrew W. Moore Slide 65
Bivariate Covariance Copyright © Andrew W. Moore Slide 66
Covarianza y desviacion estandar estimadas entre mpg y weight > cov(autompg 1[, c(1, 5)]) mpg weight mpg 60. 91814 -5517. 441 weight -5517. 44070 721484. 709 > sd(autompg 1$mpg) [1] 7. 805007 > sd(autompg 1$weight) [1] 849. 4026 Copyright © Andrew W. Moore Slide 67
Covariance Intuition E[mpg, weight] = (24. 5, 2600) Copyright © Andrew W. Moore Slide 68
Covariance Intuition Principal Eigenvector of S E[mpg, weight] = (24. 5, 2600) Copyright © Andrew W. Moore Slide 69
Regression Line Notice that the regression line pass trough ( x, y) Copyright © Andrew W. Moore Slide 70
Regression Line >l 1=lm(weight~mpg, data=autompg 1) > l 1 Call: lm(formula = weight ~ mpg, data = autompg 1) Coefficients: (Intercept) mpg 5101. 11 -90. 57 >#slope of regression line >slope= -5517. 44/60. 918 [1] -90. 571 Copyright © Andrew W. Moore Slide 71
Primer Principal component > a=cov(autompg 1[, c(1, 5)]) > eigen(a) $values [1] 721526. 90386 18. 72329 $vectors [, 1] [, 2] [1, ] -0. 007647317 0. 999970759 [2, ] 0. 999970759 0. 007647317 #slope of primer principal component >. 99997/-. 00764 [1] – 130. 8861 Copyright © Andrew W. Moore Slide 72
Covariance Fun Facts • True or False: If sxy = 0 then X and Y are independent. False • True or False: If X and Y are independent then sxy = 0. True • True or False: If sxy = sx sy then X and Y are deterministically related. True How could you prove or disprove these? • True or False: If X and Y are deterministically related then sxy = sx sy. false Copyright © Andrew W. Moore Slide 73
Test your understanding • All the time? • Only when X and Y are independent? Cierto • It can fail even if X and Y are independent? Copyright © Andrew W. Moore Slide 74
Marginal Distributions Copyright © Andrew W. Moore Slide 75
Conditional Distributions Copyright © Andrew W. Moore Slide 76
Conditional Distributions Why? Copyright © Andrew W. Moore Slide 77
Independence Revisited It’s easy to prove that these statements are equivalent… Copyright © Andrew W. Moore Slide 78
More useful stuff (These can all be proved from definitions on previous slides) Bayes Rule Copyright © Andrew W. Moore Slide 79
Mixing discrete and continuous variables Bayes Rule Copyright © Andrew W. Moore Slide 80
P(educacion, salario>50 k) Copyright © Andrew W. Moore Slide 81
Estimation of the posterior P(Class/Education) Copyright © Andrew W. Moore Slide 82
Mixing discrete and continuous variables P(Edu. Years, Wealthy) Copyright © Andrew W. Moore Slide 83
Mixing discrete and continuous variables P(Edu. Years, Wealthy) P(Wealthy| Edu. Years) Copyright © Andrew W. Moore Slide 84
Mixing discrete and continuous variables P(Edu. Years, Wealthy) P(Wealthy| Edu. Years) Renormalized Axes P(Edu. Years|Wealthy) Copyright © Andrew W. Moore Slide 85
Ejercicios • Suppose X and Y are independent realvalued random variables distributed between 0 and 1: • What is p[min(X, Y)]? • What is E[min(X, Y)]? • Prove that E[X] is the value u that minimizes E[(X-u)2] • What is the value u that minimizes E[|X-u|]? Copyright © Andrew W. Moore Slide 86
- Slides: 86