Industrial Engineering College of Engineering Bayesian Kernel Methods
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop on Clustering and Search Techniques in Large Scale Networks LATNA, Nizhny Novgorod, Russia, November 4, 2014
Part I. Bayesian Kernel Methods for Gaussian Processes
Why Bayesian Learning? • Returns a probability • Incorporates power of kernel methods with advantages of Bayesian updating • Can incorporate prior knowledge into estimation • Can “learn” fairly quickly if Gaussian process • Can be used for regression or classification
Outline 1. Bayesics 2. Relevance Vector Machine 3. Laplace Approximation 4. Results
Bayes’ Rule Calculated value Posterior Likelihood Prior
Logistic Likelihood
Prior • Assume t(x) = {t(x 1), …, t(xm)} is a Gaussian process (normally distributed) • Let t = Kα
Maximize Posterior Goal: Find optimal values for α
Minimize Negative Log = 1 if yi = 0 = 1 if yi = 1
Minimize Negative Log
Relevance Vector Machine • Combines Bayesian approach with sparseness of support vector machine • Previously • Hyperparameter 1/Variance(αi)
Non-Informative (Flat) Prior Let a = b ≈ 0
Maximize Posterior
Laplace Approximation Newton-Raphson Method ci Cii
Iteration
Optimizing Hyperparameter • Need a closed-form expression for • If α|y, s were normally distributed, then at optimal • Use Gaussian approximation
SVM and RVM Comparison Similar accuracy with fewer “support” vectors
Conclusion • Posterior Likelihood x Prior • Gaussian process ▫ Makes math easier ▫ Assumes that density is centered around mode • Relevance Vector Machine ▫ Similar accuracy to Support Vector Machine ▫ Fewer data points for RVM compared to SVM • In part 2 we discuss ▫ Non-Gaussian process ▫ Markov Chain Monte Carlo solution
References • B. Schӧlkopf and A. J. Smola, 2002. “Chapter 16: Bayesian Kernel Methods. ” Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: MIT Press. • C. M. Bishop and M. E. Tipping, 2003. “Bayesian Regression and Classification. ” In J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, eds. Advances in Learning Theory: Methods Models and Applications. Amsterdam: IOS Press.
Backup
Likelihood for Classification • Logistic • Probit
Likelihood for Regression
RVM for Regression where
Incremental Updating for Regression
Industrial Engineering College of Engineering Part II. Bayesian Kernel Methods Using Beta Distributions Theodore Trafalis Workshop on Clustering and Search Techniques in Large Scale Networks LATNA, Nizhny Novgorod, Russia, November 4, 2014
Summary of Part 1 • Bayesian method: Posterior Likelihood x Prior • Gaussian process ▫ Makes math easier ▫ Assumes that density is centered around mode • Relevance Vector Machine • Solution concept: posterior maximization
Current Bayesian Kernel methods • Combine Bayesian probability with Kernel Methods • n data points, m attributes per data point • X is n x m matrix • y is n x 1 vector of 0 s and 1 s • q(X) is a function of X used to predict y Posterior Likelihood Prior
Support Vector Machines Mac. Kenzie and Trafalis 28
What’s new in part 2 • Beta distributions as priors • Adaptation of beta-binomial updating formula • Comparison of beta kernel classifiers with existing SVM classifiers • Online learning
Outline 1. Beta Distribution 2. Other Priors 3. Markov Chain Monte Carlo 4. Test Case
Likelihood • Logistic Likelihood • Bernoulli Likelihood
Beta Distribution Prior
Shape of beta distribution Mac. Kenzie and Trafalis 33
Beta-binomial conjugate • Prior Number of trials • Likelihood • Posterior Number of ones
α and β • Let αi and βi be a function of xi • Assume
α and β • Let αi and βi be a function of xi • Assume
α and β • Let αi and βi be a function of xi • Assume
α and β • Let αi and βi be a function of xi • Assume
Applying beta-binomial to data mining • Prior • Posterior Number of zeros in training set Parameter to be tuned
Classification Rule •
Testing on data sets Beta prior is uniform: a = 1, b = 1 Rates represent mean values of percent of ones or zeros correctly classified Mac. Kenzie and Trafalis 41
Online learning Each trial uses 100 data points to update prior Updated probabilities for one data point from tornado data y=0 y=1 Mac. Kenzie and Trafalis 42
Conclusions • Adapting the beta-binomial updating rule to a kernel -based classifier can create a fast and accurate data mining algorithm • User can set prior and weights to reflect imbalanced data sets • Results are comparable to weighted SVM • Online learning combines previous and current information
Options for Prior Distributions • α and β must be greater than 0 • Assume and • Some choices are independent
Kernel Function
Directed Acyclic Graph μα s σα r μβ x γ K- K+ α β θ σβ
Markov Chain Monte Carlo (MCMC) • Simulation tool used for calculating posterior distributions • Gibbs Sampler: iterates using conditional distributions
Markov Chain Monte Carlo (MCMC) • Simulation tool used for calculating posterior distributions • Gibbs Sampler: iterates using conditional distributions
Markov Chain Monte Carlo (MCMC) • Simulation tool used for calculating posterior distributions • Gibbs Sampler: iterates using conditional distributions
Markov Chain Monte Carlo (MCMC) • Simulation tool used for calculating posterior distributions • Gibbs Sampler: iterates using conditional distributions • Software ▫ Bayesian Inference Using Gibbs Sampling (BUGS) ▫ Just Another Gibbs Sampler (JAGS)
Toy Example
Parameters for Priors
Large Gamma Needed
Results
Test Data Automatically Calculated
Comparison
Conclusion • Advantages of Beta-Bayesian Method ▫ Incorporate non-Gaussian process ▫ Results of example equal to SVM ▫ Testing data automatically calculated with MCMC • Disadvantages ▫ MCMC slow algorithm ▫ Analytical solution may not be possible ▫ Difficult to determine prior distributions • Future Work ▫ Real data ▫ More comparisons with existing methods
References • Cameron A. Mac. Kenzie, Theodore B. Trafalis and Kash Barker, “A Bayesian Beta Kernel Model for Binary Classification and Online Learning Problems”, Statistical Analysis and Data Mining, Vol. (In press 2014)
- Slides: 58