Probability Densities in Data Mining Andrew W Moore

Probability Densities in Data Mining Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www. cs. cmu. edu/~awm awm@cs. cmu. edu 412 -268 -7599 Copyright © Andrew W. Moore Slide 1

Contenido • • Porque son importantes. Notacion y fundamentos de PDF continuas. PDFs multivariadas

Porque son importantes? • • • Real Numbers occur in at least 50% of database records Can’t always quantize them So need to understand how to describe where they come from A great way of saying what’s a reasonable range of values A great way of saying how multiple attributes should reasonably co-occur Copyright © Andrew W. Moore Slide 3

Porque son importantes? • • • Can immediately get us Bayes Classifiers that are sensible with real-valued data You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things Will introduce us to linear and non-linear regression Copyright © Andrew W. Moore Slide 4

A PDF of American Ages in 2000 Copyright © Andrew W. Moore Slide 5

Poblacion de PR por grupo de edad group 0 -4 5 -9 10 -14 15 -19 20 -24 25 -29 30 -34 35 -39 40 -44 45 -49 50 -54 55 -59 60 -64 65 -69 70 -74 75 -79 80 -84 85+ Copyright © Andrew W. Moore freq 284593 301424 305025 305577 299362 277415 262959 265154 258211 239965 233597 206552 169796 141869 112416 85137 57953 51801 midpoint freq. rela 2. 5 0. 0737516 7. 5 0. 0781133 12. 5 0. 0790465 17. 5 0. 0791895 22. 5 0. 0775789 27. 5 0. 0718914 32. 5 0. 0681452 37. 5 0. 0687140 42. 5 0. 0669147 47. 5 0. 0621863 52. 5 0. 0605361 57. 5 0. 0535274 62. 5 0. 0440022 67. 5 0. 0367650 72. 5 0. 0291323 77. 5 0. 0220630 82. 5 0. 0150184 87. 5 0. 0134241 Slide 6

A PDF of American Ages in 2000 Let X be a continuous random variable.

Properties of PDFs That means… Copyright © Andrew W. Moore Slide 9

Donde x-h/2<w<x+h/2). Luego, Asi p(w) tiende a p(x) cuando h tiende a cero Copyright

Funcion de distribucion acumulativa. Esta es una funcion No decreciente Notar que p(w) tiende a p(x) cuando h tiende a 0 Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad Copyright © Andrew W. Moore Slide 11

Properties of PDFs Therefore… Copyright © Andrew W. Moore La dcerivada de una fucnion

• Cual es el significado de p(x)? Si p(5. 31) = 0. 06 and p(5. 92) = 0. 03 Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5. 31 que a 5. 92. Copyright © Andrew W. Moore Slide 13

Yet another way to view a PDF A recipe for sampling a random age. 1. Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age, d) 2. If d < p(age) stop and return age 3. Else try again: go to Step 1. Copyright © Andrew W. Moore Slide 14

Test your understanding • True or False: Copyright © Andrew W. Moore Slide 15

Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X E[age]=35. 897 = the first moment of the shape formed by the axes and the blue curve = the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error Copyright © Andrew W. Moore Slide 17

Expectation of a function =E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution. = the average value we’d see if we took a very large number of random samples of f(X) Note that in general: Copyright © Andrew W. Moore Slide 18

Variance s 2 = Var[X] = the expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally Copyright © Andrew W. Moore Slide 19

Standard Deviation s = Var[X] = the 2 expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally s = Standard Deviation = “typical” deviation of X from its mean Copyright © Andrew W. Moore Slide 20

Estadisticas para PR • • • E(edad)=35. 17 Var(edad)=501. 16 Desv. Est(edad)=22. 38 Copyright

Funciones de densidad mas conocidas Copyright © Andrew W. Moore Slide 22

Funciones de densidad mas conocidas • • La densidad uniforme o rectangular La densidad triangular La densidad exponencial La densidad Gamma y la Chi-square La densidad Beta La densidad Normal o Gaussiana Las densidades t y F. Copyright © Andrew W. Moore Slide 23

La distribucion rectangular 1/w -w/2 Copyright © Andrew W. Moore 0 w/2 Slide 24

La distribucion triangular 0 Copyright © Andrew W. Moore Slide 25

The Exponential distribution Copyright © Andrew W. Moore Slide 26

La distribucion Normal Estandar Copyright © Andrew W. Moore Slide 27

La distribucion Normal General s=15 =100 Copyright © Andrew W. Moore Slide 28

General Gaussian Also known as the normal distribution or Bellshaped curve s=15 =100 Shorthand: We say X ~ N( , s 2) to mean “X is distributed as a Gaussian with parameters and s 2”. In the above figure, X ~ N(100, 152) Copyright © Andrew W. Moore Slide 29

The Error Function Assume X ~ N(0, 1) Define ERF(x) = P(X<x) = Cumulative

Using The Error Function Assume X ~ N( , s 2) P(X<x| , s

The Central Limit Theorem • • • If (X 1, X 2, … Xn) are i. i. d. continuous random variables Then define As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi] Somewhat of a justification for assuming Gaussian noise is common Copyright © Andrew W. Moore Slide 32

Estimadores de funcion de densidad Histograms h ancho de clase • K-nearest neighbors: dk

> x=c( 7. 3, 6. 8, 7. 1, 2. 5, 7. 9, 6. 5,

Estimacion de densidad por knn en 20 pts con k=1, 3, 5, 7 Copyright

Estimación por kernels de una función de densidad univariada. En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x 1, …xn es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que Copyright © Andrew W. Moore Slide 36

donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2 h. La ecuación anterior puede ser escrita como: donde la función peso K está definida por 0 si |z|>1 K(z)= 1/2 si |z| 1 Copyright © Andrew W. Moore Slide 37

Muestra: 6, 8, 9 12, 20, 25, 18, 31 Copyright © Andrew W. Moore

este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como: K(z) debe ser acotado y absolutamente integrable en (- , ) Usualmente, pero no siempre, K(z) 0 y simétrico, luego cualquier función de densidad simétrica puede usarse como kernel. Copyright © Andrew W. Moore Slide 39

Eleccion del ancho de banda h Donde n es el numero de datos y

EL KERNEL GAUSSIANO En este caso el kernel representa una función peso más suave

EL KERNEL TRIANGULAR K(z)=1 - |z| para |z|<1, 0 en otro caso. EL KERNEL

EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso Copyright © Andrew W.

Estimacion de densidad en 20 pts usando kernel gaussiano con h=. 5, ”opt 1”,

Variables aleatorias bidimensionales p(x, y) = probability density of random variables (X, Y) at

Estimadores de funcion de densidad bi-dimensionales Histogramas A area de la clase • K-nearest neighbors: Ak es el area ncluyendo hasta el k-esimo vecino • Kernel density estimators • Copyright © Andrew W. Moore Slide 46

Estimacion de kernel bivariado Sean xi=(x, y) los valores observados y t=(t 1, t

• (a 1, a 2)H-1 donde Copyright © Andrew W. Moore Slide 48

Estimacion de densidad-Kernel Gaussiano bivariado Copyright © Andrew W. Moore Slide 49

f 1= kde 2 d(autompg 1$V 1, autompg 1$V 5, n=100) persp(f 1$x, f

contour(f 1, levels=c(8 e-6, 2 e-5, 2. 8 e-5), col=c(2, 3, 4), xlab="mpg", ylab="weight")

In 2 dimensions Copyright © Andrew W. Moore Let X, Y be a pair

In 2 dimensions Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… P( 20<mpg<30 and 2500<weight<3000) = volumen under the 2 -d surface within the red rectangle Copyright © Andrew W. Moore Slide 53

In 2 dimensions Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… P( [(mpg-25)/10]2 + [(weight-3300)/1500]2 < 1 ) = volumen under the 2 -d surface within the red oval Copyright © Andrew W. Moore Slide 54

In 2 dimensions Let X, Y be a pair of continuous random variables, and let R be some region of (X, Y) space… Take the special case of region R = “everywhere”. Remember that with probability 1, (X, Y) will be drawn from “somewhere”. So. . Copyright © Andrew W. Moore Slide 55

In m dimensions Copyright © Andrew W. Moore Let (X 1, X 2, …Xm)

Independence If X and Y are independent then knowing the value of X does not help predict the value of Y the contours say that acceleration and weight are independent Copyright © Andrew W. Moore Slide 59

Multivariate Expectation E[mpg, weight] = (24. 5, 2600) The centroid of the cloud Copyright

Multivariate Expectation > f 1= kde 2 d(autompg 1$mpg, autompg 1$weight, n=100) > dx=f 1$x[2]-f 1$x[1] > dy=f 1$y[2]-f 1$y[1] > dx [1] 0. 379798 > dy [1] 35. 62626 > meanmpg=sum(f 1$x*f 1$z)*dx*dy [1] 22. 48855 > meanweight=sum(f 1$y*f 1$z)*dx*dy [1] 2848. 638 >#estimated mean > mean(autompg 1$weight) [1] 2977. 584 > mean(autompg 1$mpg) [1] 23. 44592 Copyright © Andrew W. Moore Slide 61

Multivariate Expectation Copyright © Andrew W. Moore Slide 62

Test your understanding • All the time? Siempre • Only when X and Y

Bivariate Expectation Copyright © Andrew W. Moore Slide 64

Bivariate Covariance Copyright © Andrew W. Moore Slide 65

Bivariate Covariance Copyright © Andrew W. Moore Slide 66

Covarianza y desviacion estandar estimadas entre mpg y weight > cov(autompg 1[, c(1, 5)]) mpg weight mpg 60. 91814 -5517. 441 weight -5517. 44070 721484. 709 > sd(autompg 1$mpg) [1] 7. 805007 > sd(autompg 1$weight) [1] 849. 4026 Copyright © Andrew W. Moore Slide 67

Covariance Intuition E[mpg, weight] = (24. 5, 2600) Copyright © Andrew W. Moore Slide

Covariance Intuition Principal Eigenvector of S E[mpg, weight] = (24. 5, 2600) Copyright ©

Regression Line Notice that the regression line pass trough ( x, y) Copyright ©

Regression Line >l 1=lm(weight~mpg, data=autompg 1) > l 1 Call: lm(formula = weight ~ mpg, data = autompg 1) Coefficients: (Intercept) mpg 5101. 11 -90. 57 >#slope of regression line >slope= -5517. 44/60. 918 [1] -90. 571 Copyright © Andrew W. Moore Slide 71

Primer Principal component > a=cov(autompg 1[, c(1, 5)]) > eigen(a) $values [1] 721526. 90386 18. 72329 $vectors [, 1] [, 2] [1, ] -0. 007647317 0. 999970759 [2, ] 0. 999970759 0. 007647317 #slope of primer principal component >. 99997/-. 00764 [1] – 130. 8861 Copyright © Andrew W. Moore Slide 72

Covariance Fun Facts • True or False: If sxy = 0 then X and Y are independent. False • True or False: If X and Y are independent then sxy = 0. True • True or False: If sxy = sx sy then X and Y are deterministically related. True How could you prove or disprove these? • True or False: If X and Y are deterministically related then sxy = sx sy. false Copyright © Andrew W. Moore Slide 73

Test your understanding • All the time? • Only when X and Y are

Marginal Distributions Copyright © Andrew W. Moore Slide 75

Conditional Distributions Copyright © Andrew W. Moore Slide 76

Conditional Distributions Why? Copyright © Andrew W. Moore Slide 77

Independence Revisited It’s easy to prove that these statements are equivalent… Copyright © Andrew

More useful stuff (These can all be proved from definitions on previous slides) Bayes

Mixing discrete and continuous variables Bayes Rule Copyright © Andrew W. Moore Slide 80

P(educacion, salario>50 k) Copyright © Andrew W. Moore Slide 81

Estimation of the posterior P(Class/Education) Copyright © Andrew W. Moore Slide 82

Mixing discrete and continuous variables P(Edu. Years, Wealthy) Copyright © Andrew W. Moore Slide

Mixing discrete and continuous variables P(Edu. Years, Wealthy) P(Wealthy| Edu. Years) Copyright © Andrew

Mixing discrete and continuous variables P(Edu. Years, Wealthy) P(Wealthy| Edu. Years) Renormalized Axes P(Edu.

Ejercicios • Suppose X and Y are independent realvalued random variables distributed between 0 and 1: • What is p[min(X, Y)]? • What is E[min(X, Y)]? • Prove that E[X] is the value u that minimizes E[(X-u)2] • What is the value u that minimizes E[|X-u|]? Copyright © Andrew W. Moore Slide 86