Introduction to Probabilistic Graphical Models Eran Segal Weizmann
Introduction to Probabilistic Graphical Models Eran Segal Weizmann Institute
Logistics n Staff: n Instructor n n Teaching Assistants n n Ohad Manor (ohad. manor@weizmann. ac. il, room 125) Course information: n n Eran Segal (eran. segal@weizmann. ac. il, room 149) WWW: http: //www. weizmann. ac. il/math/pgm Course book: n “Bayesian Networks and Beyond”, Daphne Koller (Stanford) & Nir Friedman (Hebrew U. )
Course structure n One weekly meeting n n Homework assignments n n n Sun: 9 am-11 am 2 weeks to complete each 40% of final grade Final exam n n 3 hour class exam, date will be announced 60% of final grade
Probabilistic Graphical Models n n Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity n n Complex systems are built by combining simpler parts Why have a model? n n Compact and modular representation of complex systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem
Probabilistic Graphical Models n n Increasingly important in Machine Learning Many classical probabilistic problems in statistics, information theory, pattern recognition, and statistical mechanics are special cases of the formalism n n Graphical models provides a common framework Advantage: specialized techniques developed in one field can be transferred between research communities
Representation: Graphs n n n Intuitive data structure for modeling highly-interacting sets of variables Explicit model for modularity Data structure that allows for design of efficient general-purpose algorithms
Reasoning: Probability Theory n Well understood framework for modeling uncertainty n n Partial knowledge of the state of the world Noisy observations Phenomenon not covered by our model Inherent stochasticity n Clear semantics n Can be learned from data
Probabilistic Reasoning n In this course we will learn: n Semantics of probabilistic graphical models (PGMs) n n n Bayesian networks Markov networks Answering queries in a PGMs (“inference”) Learning PGMs from data (“learning”) Modeling temporal processes with PGMs n Hidden Markov Models (HMMs) as a special case
Course Outline Week 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Topic Introduction, Bayesian network representation cont. Local probability models Undirected graphical models Exact inference cont. Approximate inference cont. Learning: Parameters cont. Learning: Structure Partially observed data Learning undirected graphical models Template models Dynamic Bayesian networks Reading 1 -3 5 4 9, 10 12 12 16, 17 18 19 20 6 15
A Simple Example n n We want to model whether our neighbor will inform us of the alarm being set off The alarm can set off if n n n There is a burglary There is an earthquake Whether our neighbor calls depends on whether the alarm is set off
A Simple Example n Variables n Earthquake (E), Burglary (B), Alarm (A), Neighbor. Calls (N) E B A N Prob. F F 0. 01 F F F T 0. 04 F F T F 0. 05 F F T T 0. 01 F T F F 0. 02 F T 0. 07 F T T F 0. 2 F T T T 0. 1 T F F F 0. 01 T F F T 0. 07 T F 0. 13 T F T T 0. 04 T T F F 0. 06 T T F T 0. 05 T T T F 0. 1 T T 0. 05 24 -1 independent parameters
A Simple Example B E F T 0. 9 0. 1 0. 7 0. 3 Earthquake Burglary A Alarm E B F T F F 0. 99 0. 01 F T 0. 1 0. 9 T F 0. 3 0. 7 T T 0. 01 0. 99 Neighbor. Calls N A F T F 0. 9 0. 1 T 0. 2 0. 8 7 independent parameters
Example Bayesian Network n The “Alarm” network for monitoring intensive care patients n n 509 parameters (full joint 237) 37 variables PULMEMBOLUS PAP MINVOLSET KINKEDTUBE INTUBATION SHUNT VENTMACH VENTLUNG VENITUBE PRESS MINOVL ANAPHYLAXIS SAO 2 TPR HYPOVOLEMIA LVEDVOLUME CVP PCWP LVFAILURE STROEVOLUME FIO 2 VENTALV PVSAT ARTCO 2 EXPCO 2 INSUFFANESTH CATECHOL HISTORY ERRBLOWOUTPUT CO BP HR HREKG HRBP DISCONNECT ERRCAUTER HRSAT
Application: Clustering Users n n Input: TV shows that each user watches Output: TV show “clusters” n Assumption: shows watched by same users are similar Class 1 • Power rangers • Animaniacs • X-men • Tazmania • Spider man Class 2 • Young and restless • Bold and the beautiful • As the world turns • Price is right • CBS eve news Class 4 • 60 minutes • NBC nightly news • CBS eve news • Murder she wrote • Matlock Class 3 • Tonight show • Conan O’Brien • NBC nightly news • Later with Kinnear • Seinfeld Class 5 • Seinfeld • Friends • Mad about you • ER • Frasier
App. : Recommendation Systems n n Given user preferences, suggest recommendations Example: Amazon. com Input: movie preferences of many users Solution: model correlations between movie features n n Users that like comedy, often like drama Users that like action, often do not like cartoons Users that like Robert Deniro films often like Al Pacino films Given user preferences, can predict probability that new movies match preferences
Diagnostic Systems n n Diagnostic indexing for home health site at microsoft Enter symptoms recommend multimedia content
Online Trouble. Shooters
App. : Finding Regulatory Networks P(Level | Module, Regulators) Module HAP 4 1 Expression level of CMK 1 Regulator 1 in experiment What module 0 does gene “g” belong to? 0 Experiment Regulator 1 0 BMH 1 Regulator 2 GIC 2 2 Module Regulator 3 Gene 0 0 Expression level in each module is a function of expression of regulators 0 Level Expression
App. : Finding Regulatory Networks Experimentally tested regulator 11 8 GATA Enriched cis-Regulatory Motif Gat 1 Regulator (transcription factor) 10 N 11 Regulation supported in literature Hap 4 Regulator (Signaling molecule) 9 GCN 4 Inferred regulation 1 CBF 1_B Energy and c. AMP signaling Msn 4 Xbp 1 25 HAP 234 3 Sip 2 2 STRE 48 Module (number) Kin 82 Tpk 1 N 41 N 14 41 33 N 13 MIG 1 REPCAR CAT 8 ADR 1 N 26 MCM 1 18 13 17 15 14 DNA and RNA processing nuclear Yer 184 c Cmk 1 Ppt 1 Lsg 1 4 N 18 Pph 3 26 Gis 1 Tpk 2 Gac 1 30 42 N 30 GCR 1 HSF XBP 1 HAC 1 5 16 Yap 6 Ypl 230 w ABF_C N 36 47 39 Ime 4 Bmh 1 Gcn 20 Not 3 31 36 Amino acid metabolism
Prerequisites n Probability theory n n n n Conditional probabilities Joint distribution Random variables Information theory Function optimization Graph theory Computational complexity
Probability Theory n Probability distribution P over ( , S) is a mapping from events in S such that: n n n P( ) 0 for all S P( ) = 1 If , S and = , then P( )=P( )+P( ) n Conditional Probability: n Chain Rule: n Bayes Rule: n Conditional Independence:
Random Variables & Notation n n Random variable: Function from to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e. g. , X, Y, Z) Upper case bold letters denote set of RVs (e. g. , X, Y) Lower case letters denote RV values (e. g. , x, y, z) Lower case bold letters denote RV set values (e. g. , x) Values for categorical RVs with |Val(X)|=k: x 1, x 2, …, xk Marginal distribution over X: P(X) Conditional independence: X is independent of Y given Z if:
Expectation n Discrete RVs: n Continuous RVs: n Linearity of expectation: n Expectation of products: (when X Y in P) Independence assumption
Variance n Variance of RV: n If X and Y are independent: Var[X+Y]=Var[X]+Var[Y] n Var[a. X+b]=a 2 Var[X]
Information Theory n Entropy: n n n Conditional entropy: Information only helps: Mutual information: n n We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values of X 0 Hp(X) log|Val(X)| for any distribution P(X) 0 Ip(X; Y) Hp(X) Symmetry: Ip(X; Y)= Ip(Y; X) Ip(X; Y)=0 iff X and Y are independent Chain rule of entropies:
Distances Between Distributions n Relative Entropy: n n n D(P�Q) 0 D(P�Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality) L 1 distance: L 2 distance: L distance:
Optimization Theory n n Find values 1, … n such that Optimization strategies n n Solve gradient analytically and verify local maximum Gradient search: guess initial values, and improve iteratively n n Gradient ascent Line search Conjugate gradient Lagrange multipliers n n Solve maximization problem with constraints Maximize
Graph Theory n n n n Undirected graph Directed graph Complete graph (every two nodes connected) Acyclic graph Partially directed acyclic graph (PDAG) Induced graph Sub-graph Graph algorithms n Shortest path from node X 1 to all other nodes (BFS)
Representing Joint Distributions n n Random variables: X 1, …, Xn P is a joint distribution over X 1, …, Xn If X 1, . . , Xn binary, need 2 n parameters to describe P Can we represent P more compactly? n Key: Exploit independence properties
Independent Random Variables n Two variables X and Y are independent if n n n If X and Y are independent then: n n P(X=x|Y=y) = P(X=x) for all values x, y Equivalently, knowing Y does not change predictions of X P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) If X 1, …, Xn are independent then: n n P(X 1, …, Xn) = P(X 1)…P(Xn) O(n) parameters All 2 n probabilities are implicitly defined Cannot represent many types of distributions
Conditional Independence n X and Y are conditionally independent given Z if n n n P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not change predictions of X Notation: Ind(X; Y | Z) or (X Y | Z)
Conditional Parameterization n n S = Score on test, Val(S) = {s 0, s 1} I = Intelligence, Val(I) = {i 0, i 1} P(I, S) I S P(I, S) i 0 s 0 0. 665 i 0 s 1 0. 035 i 1 s 0 0. 06 i 1 s 1 0. 24 P(I) P(S|I) I S i 0 i 1 I s 0 s 1 0. 7 0. 3 i 0 0. 95 0. 05 i 1 0. 2 0. 8 Joint parameterization Conditional parameterization 3 parameters Alternative parameterization: P(S) and P(I|S)
Conditional Parameterization n n S = Score on test, Val(S) = {s 0, s 1} I = Intelligence, Val(I) = {i 0, i 1} G = Grade, Val(G) = {g 0, g 1, g 2} Assume that G and S are independent given I n Joint parameterization n n 2 2 3=12 -1=11 independent parameters Conditional parameterization has n n n P(I, S, G) = P(I)P(S|I)P(G|I, S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 2 1 independent parameters P(G|I) - 2 2 independent parameters 7 independent parameters
Naïve Bayes Model n n n Class variable C, Val(C) = {c 1, …, ck} Evidence variables X 1, …, Xn Naïve Bayes assumption: evidence variables are conditionally independent given C n Applications in medical diagnosis, text classification Used as a classifier: n Problem: Double counting correlated evidence n
Bayesian Network (Informal) n Directed acyclic graph G n n n Nodes represent random variables Edges represent direct influences between random variables Local probability models I S Example 1 I S C G Example 2 X 1 X 2 … Xn Naïve Bayes
Bayesian Network (Informal) n Represent a joint distribution n Specifies the probability for P(X=x) Specifies the probability for P(X=x|E=e) Allows for reasoning patterns n n n Prediction (e. g. , intelligent high scores) Explanation (e. g. , low score not intelligent) Explaining away (different causes for same effect interact) I S G Example 2
Bayesian Network Structure n Directed acyclic graph G n n Nodes X 1, …, Xn represent random variables G encodes local Markov assumptions n n Xi is independent of its non-descendants given its parents Formally: (Xi Non. Desc(Xi) | Pa(Xi)) A E {A, C, D, F} | B D B C E F G
Independency Mappings (I-Maps) n n n Let P be a distribution over X Let I(P) be the independencies (X Y | Z) in P A Bayesian network structure is an I-map (independency mapping) of P if I(G) I(P) I S I(G)={I S} I S P(I, S) i 0 s 0 0. 25 i 0 s 0 0. 4 i 0 s 1 0. 25 i 0 s 1 0. 3 i 1 s 0 0. 25 i 1 s 0 0. 2 i 1 s 1 0. 25 i 1 s 1 0. 1 I(P)={I S} I(P)= I S I(G)=
Factorization Theorem n If G is an I-Map of P, then Proof: n wlog. X 1, …, Xn is an ordering consistent with G n By chain rule: n From assumption: n Since G is an I-Map (Xi; Non. Desc(Xi)| Pa(Xi)) I(P)
Factorization Implies I-Map n G is an I-Map of P Proof: n Need to show (Xi; Non. Desc(Xi)| Pa(Xi)) I(P) or that P(Xi | Non. Desc(Xi)) = P(Xi | Pa(Xi)) n n wlog. X 1, …, Xn is an ordering consistent with G
Bayesian Network Definition n A Bayesian network is a pair (G, P) n n n P factorizes over G P is specified as set of CPDs associated with G’s nodes Parameters n n Joint distribution: 2 n Bayesian network (bounded in-degree k): n 2 k
Bayesian Network Design n Variable considerations n n Structure considerations n n n Clarity test: can an omniscient being determine its value? Hidden variables? Irrelevant variables Causal order of variables Which independencies (approximately) hold? Probability considerations n n n Zero probabilities Orders of magnitude Relative values
- Slides: 42