Advances in Bayesian Learning and Inference in Bayesian

Advances in Bayesian Learning and Inference in Bayesian Networks Irina Rish IBM T. J. Watson Research Center rish@us. ibm. com

“Road map” Introduction and motivation: n n How to use them n n Probabilistic inference How to learn them n n What are Bayesian networks and why use them? n Learning parameters n Learning graph structure Summary

Bayesian Networks Smoking Bronchitis lung Cancer X-ray Dyspnoea P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

What are they good for? n Diagnosis: P(cause|symptom)=? n Prediction: P(symptom|cause)=? n Classification: n Decision-making (given a cost function) Medicine P(class|data) symptom Bioinformatics Speech recognition Stock market cause Text Classification Computer troubleshooting

Example: Printer Troubleshooting

“Road map” Introduction and motivation: ü ü How to use them ü ü Probabilistic inference Why and how to learn them n n What are Bayesian networks and why use them? n Learning parameters n Learning graph structure Summary

Why learn Bayesian networks? n n Combining domain expert knowledge with data <9. 7 0. 6 8 14 18> <0. 2 1. 3 5 ? ? > <1. 3 2. 8 ? ? 0 1 > <? ? 5. 6 0 10 ? ? > ………………. Efficient representation and inference n Incremental learning: P(H) or n Handling missing data: n Learning causal relationships: S <1. 3 2. 8 ? ? 0 1 > C

Learning Bayesian Networks n Known graph – learn parameters S ØComplete data: P(C|S) parameter estimation (ML, MAP) ØIncomplete data: non-linear parametric optimization (gradient descent, EM) n P(S) B C P(X|C, S) X D P(B|S) P(D|C, B) Unknown graph – learn graph and parameters S ØComplete data: optimization (search in space of graphs) ØIncomplete data: structural EM, mixture models S B C X D

Learning Parameters: complete data n ML-estimate: C B Multinomial - decomposable! counts X n MAP-estimate (Bayesian statistics) Conjugate priors - Dirichlet Equivalent sample size (prior knowledge)

Learning Parameters: incomplete data Non-decomposable marginal likelihood (hidden nodes) Data Initial parameters Current model Expectation Inference: P(S|X=0, D=1, C=0, B=1) S X D <? 0 1 <1 1 ? <0 0 0 <? ? 0 ……… C 0 0 ? ? B 1> 1> ? > 1> Expected counts Maximization Update parameters (ML, MAP) EM-algorithm: iterate until convergence S 1 1 0 1 X D C 0 1 1 0 0 0 0 ………. . B 1 1 0 1

Learning graph structure Find n NP-hard optimization Heuristic search: Ø Ø Ø Greedy local search Best-first search Simulated annealing B C Complete data – local computations Incomplete data (score non-decomposable): Structural EM n Add S->B S Delete S->B Data impose independence relations (constraints) B C Reverse S->B S C Constrained-based methods Ø S B S C B

Scoring functions: Minimum Description Length (MDL) n Learning data compression <9. 7 0. 6 8 14 18> <0. 2 1. 3 5 ? ? > <1. 3 2. 8 ? ? 0 1 > <? ? 5. 6 0 10 ? ? > ………………. DL(Data|model) n n DL(Model) Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL

Summary n n n Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning: n parameters (parameter estimation, EM) n structure (optimization w/ score functions – e. g. , MDL) Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (Auto. Class (NASA), TANBLT(SRI)) Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.