Conditional Probability Distributions Eran Segal Weizmann Institute Last

Conditional Probability Distributions Eran Segal Weizmann Institute

Last Time n n n Local Markov assumptions – basic BN independencies d-separation – all independencies via graph structure G is an I-Map of P if and only if P factorizes over G I-equivalence – graphs with identical independencies Minimal I-Map n n All distributions have I-Maps (sometimes more than one) Minimal I-Map does not capture all independencies in P Perfect Map – not every distribution P has one PDAGs n n Compact representation of I-equivalence graphs Algorithm for finding PDAGs

CPDs n n Thus far we ignored the representation of CPDs Today we will cover the range of CPD representations n n n Discrete Continuous Sparse Deterministic Linear

Table CPDs n Entry for each joint assignment of X and Pa(X) For each pax: Most general representation Represents every discrete CPD n Limitations n n n n Cannot model continuous RVs Number of parameters exponential in |Pa(X)| Cannot model large in-degree dependencies Ignores structure within the CPD I S P(I) P(S|I) I S i 0 i 1 I s 0 s 1 0. 7 0. 3 i 0 0. 95 0. 05 i 1 0. 2 0. 8

Structured CPDs n n Key idea: reduce parameters by modeling P(X|Pa. X) without explicitly modeling all entries of the joint Lose expressive power (cannot represent every CPD)

Deterministic CPDs n There is a function f: Val(Pa. X) Val(X) such that n Examples n n OR, AND, NAND functions Z = Y+X (continuous variables)

Deterministic CPDs n n Replace spurious dependencies with deterministic CPDs Need to make sure that deterministic CPD is compactly stored T 1 T 2 T S S S T 1 T 2 s 0 s 1 t 0 0. 95 0. 05 t 0 t 1 0. 2 0. 8 t 1 t 0 0. 2 0. 8 T s 0 s 1 t 1 0. 2 0. 8 t 0 0. 95 0. 05 t 1 0. 2 0. 8 S T T 1 T 2 t 0 t 1 t 0 1 0 t 1 0 1 t 0 0 1 t 1 0 1

Deterministic CPDs n n Induce additional conditional independencies Example: T is any deterministic function of T 1, T 2 T 1 Ind(S 1; S 2 | T 1, T 2) T 2 T S 1 S 2

Deterministic CPDs n n Induce additional conditional independencies Example: C is an XOR deterministic function of A, B D Ind(D; E | B, C) A B C E

Deterministic CPDs n n Induce additional conditional independencies Example: T is an OR deterministic function of T 1, T 2 T 1 Ind(S 1; S 2 | T 1=t 1) T 2 T S 1 S 2 Context specific independencies

Context Specific Independencies n n n Let X, Y, Z be pairwise disjoint RV sets Let C be a set of variables and c Val(C) X and Y are contextually independent given Z and c, denoted (X c. Y | Z, c) if:

Tree CPDs A B C A B D C D D A B C d 0 d 1 a 0 b 0 c 0 0. 2 0. 8 a 0 b 0 c 1 0. 2 0. 8 a 0 b 1 c 0 0. 2 0. 8 a 0 b 1 c 1 0. 2 0. 8 a 1 b 0 c 0 0. 9 0. 1 a 1 b 0 c 1 0. 7 0. 3 a 1 b 1 c 0 0. 4 0. 6 A 1 b 1 C 1 0. 4 0. 6 8 parameters A a 0 a 1 (0. 2, 0. 8) B b 0 b 1 C c 0 (0. 9, 0. 1) (0. 4, 0. 6) c 1 (0. 7, 0. 3) 4 parameters

Gene Regulation: Simple Example Regulated gene Activator Repressor Regulators State 3 r r so s e pr e R Acti vato State 2 Acti vato r State 1 Activator Repressor Regulators DNA Microarray Regulated gene

Regulation Tree Segal et al. , Nature Genetics ’ 03 Activator? Activator expression false true Regulation program Repressor? Repressor expression false true Module genes State 1 State 2 State 3

Respiration Module n Segal et al. , Nature Genetics ’ 03 Module genes known targets of predicted regulators? Predicted regulator Regulation program Module genes Hap 4+Msn 4 known to regulate module genes

Rule CPDs n n A rule r is a pair (c; p) where c is an assignment to a subset of variables C and p [0, 1]. Let Scope[r]=C A rule-based CPD P(X|Pa(X)) is a set of rules R s. t. n n For each rule r R Scope[r] {X} Pa(X) For each assignment (x, u) to {X} Pa(X) we have exactly one rule (c; p) R such that c is compatible with (x, u). Then, we have P(X=x | Pa(X)=u) = p

Rule CPDs n Example n n n Let X be a variable with Pa(X) = {A, B, C} r 1: (a 1, b 1, x 0; 0. 1) r 2: (a 0, c 1, x 0; 0. 2) r 3: (b 0, c 0, x 0; 0. 3) r 4: (a 1, b 0, c 1, x 0; 0. 4) r 5: (a 0, b 1, c 0; 0. 5) r 6: (a 1, b 1, x 1; 0. 9) r 7: (a 0, c 1, x 1; 0. 8) r 8: (b 0, c 0, x 1; 0. 7) r 9: (a 1, b 0, c 1, x 1; 0. 6) Note: each assignment maps to exactly one rule Rules cannot always be represented compactly within tree CPDs

Tree CPDs and Rule CPDs n n n Can represent every discrete function Can be easily learned and dealt with in inference But, some functions are not represented compactly n n XOR in tree CPDs: cannot split in one step on a 0, b 1 and a 1, b 0 Alternative representations exist n Complex logical rules

Context Specific Independencies A A=a 1 Ind(D, C | A=a 1) A B C D D A a 0 A=a 0 Ind(D, B | A=a 0) A B C C c 0 D a 1 (0. 9, 0. 1) B c 1 (0. 7, 0. 3) b 0 (0. 2, 0. 8) Reasoning by cases implies that Ind(B, C | A, D) b 1 (0. 4, 0. 6)

Independence of Causal Influence n n Causes: X 1, …Xn Effect: Y X 1 X 2 . . . Xn Y n n General case: Y has a complex dependency on X 1, …Xn Common case n n Each Xi influences Y separately Influence of X 1, …Xn is combined to an overall influence on Y

Example 1: Noisy OR n n n Two independent effects X 1, X 2 Y=y 1 cannot happen unless one of X 1, X 2 occurs P(Y=y 0 | X 1=x 10 , X 2=x 20) = P(X 1=x 10)P(X 2=x 20) X 1 X 2 Y Y X 1 X 2 y 0 y 1 x 10 x 20 1 0 x 10 x 21 0. 2 0. 8 x 11 x 20 0. 1 0. 9 x 11 x 21 0. 02 0. 98

Noisy OR: Elaborate Representation X 1 X 2 X’ 1 X 1 x 10 x 11 x 10 1 0 x 11 0. 9 X’ 1 Noisy CPD 1 X’ 2 x 20 x 21 x 20 1 0 x 21 0. 2 0. 8 Noisy CPD 2 Y Noise parameter X 1=0. 9 X 2 Noise parameter X 1=0. 8 Y X’ 1 X’ 2 y 0 y 1 x 10 x 20 1 0 x 10 x 21 0 1 x 11 x 20 0 1 x 11 x 21 0 1 Deterministic OR

Noisy OR: Elaborate Representation Decomposition results in the same distribution

Noisy OR: General Case n n Y is a binary variable with k binary parents X 1, . . . Xn CPD P(Y | X 1, . . . Xn) is a noisy OR if there are k+1 noise parameters 0, 1, . . . , n such that

Noisy OR Independencies X 1 X 2 X’ 1 X’ 2 Xn . . . X’n Y i j: Ind(Xi Xj | Y=yo)

Generalized Linear Models n n Model is a soft version of a linear threshold function Example: logistic function n n Binary variables X 1, . . . Xn, Y

General Formulation n n Let Y be a random variable with parents X 1, . . . Xn The CPD P(Y | X 1, . . . Xn) exhibits independence of causal influence (ICI) if it can be described by Logistic X 1 X 2 . . . Xn Z 1 Z 2 . . . Zn Noisy OR Zi = wi 1(Xi=1) Zi has noise model Z = Zi Z is an OR function Y = logit (Z) Z Y is the identity CPD Y n The CPD P(Z | Z 1, . . . Zn) is deterministic

General Formulation n n Key advantage: O(n) parameters As stated, not all that useful as any complex CPD can be represented through a complex deterministic CPD

Continuous Variables n One solution: Discretize n n n Often requires too many value states Loses domain structure Other solution: use continuous function for P(X|Pa(X)) n Can combine continuous and discrete variables, resulting in hybrid networks n Inference and learning may become more difficult

Gaussian Density Functions n n Among the most common continuous representations Univariate case: 0. 4 0. 35 0. 3 0. 25 0. 2 0. 15 0. 1 0. 05 0 -4 -2 0 2 4

Gaussian Density Functions n A multivariate Gaussian distribution over X 1, . . . Xn has n n n Mean vector nxn positive definite covariance matrix positive definite: Joint density function: i=E[Xi] ii=Var[Xi] ij=Cov[Xi, Xj]=E[Xi. Xj]-E[Xi]E[Xj] (i j)

Gaussian Density Functions n Marginal distributions are easy to compute n Independencies can be determined from parameters n n If X=X 1, . . . Xn have a joint normal distribution N( ; ) then Ind(Xi Xj) iff ij=0 Does not hold in general for non-Gaussian distributions

Linear Gaussian CPDs n n Y is a continuous variable with parents X 1, . . . Xn Y has a linear Gaussian model if it can be described using parameters 0, . . . , n and 2 such that n n n Pros n n n Vector notation: Simple Captures many interesting dependencies Cons n Fixed variance (variance cannot depend on parents values)

Linear Gaussian Bayesian Network n A linear Gaussian Bayesian network is a Bayesian network where n n n All variables are continuous All of the CPDs are linear Gaussians Key result: linear Gaussian models are equivalent to multivariate Gaussian density functions

Equivalence Theorem n n n Y is a linear Gaussian of its parents X 1, . . . Xn: Assume that X 1, . . . Xn are jointly Gaussian with N( ; ) Then: n The marginal distribution of Y is Gaussian with N( Y; Y 2) n The joint distribution over {X, Y} is Gaussian where Linear Gaussian BNs define a joint Gaussian distribution

Converse Equivalence Theorem n If {X, Y} have a joint Gaussian distribution then n Implications of equivalence n n Joint distribution has compact representation: O(n 2) We can easily transform back and forth between Gaussian distributions and linear Gaussian Bayesian networks Representations may differ in parameters X 2 Xn Example: X 1. . . n n Gaussian distribution has full covariance matrix Linear Gaussian

Hybrid Models n Models of continuous and discrete variables n n n Continuous variables with discrete parents Discrete variables with continuous parents Conditional Linear Gaussians n n n Y continuous variable X = {X 1, . . . , Xn} continuous parents U = {U 1, . . . , Um} discrete parents n n A Conditional Linear Bayesian network is one where n n Discrete variables have only discrete parents Continuous variables have only CLG CPDs

Hybrid Models n Continuous parents for discrete children n Threshold models n Linear sigmoid

Summary: CPD Models n n n Deterministic functions Context specific dependencies Independence of causal influence n n n Noisy OR Logistic function CPDs capture additional domain structure