Spicy MKL Efficient multiple kernel learning method using

Multiple Kernel Learning Fast Algorithm Spicy. MKL 2

Kernel Learning Regression, Classification Kernel Function 3

• Gaussian, polynomial, chi-square, …. – Parameters：Gaussian width, polynomial degree • Thousands of

Multiple Kernel Learning (MKL) Single Kernel Multiple Kernel dm: convex combination, sparse (many 0

Sparse learning grouping Lasso Group Lasso kernelize Multiple Kernel Learning (MKL) ：N samples ：

Representer theorem Finite dimensional problem ：Gram matrix of mth kernel 7

Spicy. MKL Proposed: Spicy. MKL • Scales well against # of kernels. • Formulated

Outline • Introduction • Details of Multiple Kernel Learning • Our method – Proximal

Details of Multiple Kernel Learning(MKL) 10

Generalized Formulation MKL : sparse regularization 11

hinge loss elastic net logistic loss L 1 regularization hinge: elastic net: logistic: L

Rank 1 decomposition If g is monotonically increasing and f` is diff’ble at the

Existing approach 1 Constraints based formulation: [Lanckriet et al. 2004, JMLR] [Bach, Lanckriet &

Existing approach 2 Upper bound based formulation: [Sonnenburg et al. 2006, JMLR] [Rakotomamonjy et

Problem of existing methods • Do not make use of sparsity during the optimization.

• DAL (Tomioka & Sugiyama 09) – Dual Augmented Lagrangian – Lasso, group

Difficulty of MKL（Lasso） is non-diff’ble at 0. 0 Relax by Proximal Minimization (Rockafellar, 1976)

Our algorithm Spicy. MKL: Proximal Minimization (Rockafellar, 1976) 1 Set some initial solution .

Fenchel’s duality theorem Split into two parts 22

Non-diff’ble dual Non-diff’ble Smooth ! Moreau envelope 23

Inner optimization • Newton method is applicable. Twice Differentiable (a. e. ) Even if

Rigorous Derivation of Convolution and duality 25

Moreau envelope (Moreau 65) For a convex function , we define its Moreau envelope

Moreau envelope L 1 -penalty dual Smooth ! 27

What we have observed We arrived at … Dual of update rule in proximal

Skipping inactive kernels Soft threshold Hard threshold If (inactive), , and its gradient also

Derivatives • We can apply Newton method for the dual optimization. The objective function:

Convergence result Thm. For any proper convex functions and , and any non-decreasing sequence

Non-differentiable loss • If loss is non-differentiable, almost the same algorithm is available by

Dual of elastic net primal dual Smooth ! Naïve optimization in the dual is

Why ‘Spicy’ ? • Spicy. MKL = DAL + MKL. • DAL also means

IDA data set, L 1 regularization. CPU time against # of kernels Splice Spicy.

IDA data set, L 1 regularization. CPU time against # of samples Splice Spicy.

UCI dataset # of kernels 189 243 918 891 1647 39

Comparison in elastic net (Sparse v. s. Dense) 40

Sparse or Dense ? Elastic Net where . 0 Sparse 1 Dense Compare performances

caltech 101 dataset (Fei-Fei et al. , 2004) • object recognition task • anchor,

Performances are averaged over all classification problems. Best Performance of sparse solution is comparable

Conclusion and Future works まとめと今後の課題 Conclusion • We proposed a new MKL algorithm that

• Technical report T. Suzuki & R. Tomioka: Spicy. MKL. ar. Xiv: http:

Some properties Moreau envelope proximal operator (Fenchel duality th. ) Differentiable ! 0 47

Slides: 47

Download presentation

Spicy. MKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School of Information Science and Technology Department of Mathematical Informatics 23 th Oct. , 2009@ISM 1

Multiple Kernel Learning Fast Algorithm Spicy. MKL 2

Kernel Learning Regression, Classification Kernel Function 3

• Gaussian, polynomial, chi-square, …. – Parameters：Gaussian width, polynomial degree • Thousands of kernels Features – Computer Vision：color, gradient, sift (sift, hsvsift, huesift, scaling of sift), Geometric Blur, image regions, ．．． Multiple Kernel Learning (Lanckriet et al. 2004) Select important kernels and combine them 4

Multiple Kernel Learning (MKL) Single Kernel Multiple Kernel dm: convex combination, sparse (many 0 components) 5

Sparse learning grouping Lasso Group Lasso kernelize Multiple Kernel Learning (MKL) ：N samples ： M kernels L 1 -regularization →sparse ：RKHS associated with km ：Convex loss（hinge, square, logistic） 6

Representer theorem Finite dimensional problem ：Gram matrix of mth kernel 7

Spicy. MKL Proposed: Spicy. MKL • Scales well against # of kernels. • Formulated in a general framework. 8

Outline • Introduction • Details of Multiple Kernel Learning • Our method – Proximal minimization – Skipping inactive kernels • Numerical Experiments – Bench-mark datasets – Sparse or dense 9

Details of Multiple Kernel Learning(MKL) 10

Generalized Formulation MKL : sparse regularization 11

hinge loss elastic net logistic loss L 1 regularization hinge: elastic net: logistic: L 1 regularization: 12

Rank 1 decomposition If g is monotonically increasing and f` is diff’ble at the optimum, the solution of MKL is rank 1: such that convex combination of kernels Proof Derivative w. r. t. 13

Existing approach 1 Constraints based formulation: [Lanckriet et al. 2004, JMLR] [Bach, Lanckriet & Jordan 2004, ICML] primal Dual • Lanckriet et al. 2004: SDP • Bach, Lanckriet & Jordan 2004: SOCP 14

Existing approach 2 Upper bound based formulation: [Sonnenburg et al. 2006, JMLR] [Rakotomamonjy et al. 2008, JMLR] [Chapelle & Rakotomamonjy 2008, NIPS workshop] primal (Jensen’s inequality) • Sonnenburg et al. 2006: Cutting plane (SILP) • Rakotomamonjy et al. 2008: Gradient descent (Simple. MKL) • Chapelle & Rakotomamonjy 2008: Newton method (Hessian. MKL) 15

Problem of existing methods • Do not make use of sparsity during the optimization. →　Do not perform efficiently when # of kernels is large. 16

Our method Spicy. MKL 18

• DAL (Tomioka & Sugiyama 09) – Dual Augmented Lagrangian – Lasso, group Lasso, trace norm regularization • Spicy. MKL = DAL + MKL – Kernelization of DAL 19

Difficulty of MKL（Lasso） is non-diff’ble at 0. 0 Relax by Proximal Minimization (Rockafellar, 1976) 20

Our algorithm Spicy. MKL: Proximal Minimization (Rockafellar, 1976) 1 Set some initial solution . Choose increasing sequence . 2 Iterate the following update rule until convergence: Regularization toward the last update. It can be shown that • • super linearly ! Taking its dual, the update step can be solved efficiently 21

Fenchel’s duality theorem Split into two parts 22

Non-diff’ble dual Non-diff’ble Smooth ! Moreau envelope 23

Inner optimization • Newton method is applicable. Twice Differentiable (a. e. ) Even if is not differentiable, we can apply almost the same algorithm. 24

Rigorous Derivation of Convolution and duality 25

Moreau envelope (Moreau 65) For a convex function , we define its Moreau envelope as 26

Moreau envelope L 1 -penalty dual Smooth ! 27

What we have observed We arrived at … Dual of update rule in proximal minimization • Differentiable • Only active kernels need to be calculated. (next slide) 28

Skipping inactive kernels Soft threshold Hard threshold If (inactive), , and its gradient also vanishes. 29

Derivatives • We can apply Newton method for the dual optimization. The objective function: Its derivatives: where • Only active kernels contributes their computations. → Efficient for large number of kernels. 30

Convergence result Thm. For any proper convex functions and , and any non-decreasing sequence , If is strongly convex and is sparse reg. , for any increasing sequence , the convergence rate is super linear 31

Non-differentiable loss • If loss is non-differentiable, almost the same algorithm is available by utilizing Moreau envelope of . primal dual Moreau envelope of dual 32

Dual of elastic net primal dual Smooth ! Naïve optimization in the dual is applicable. But we applied prox. minimization method for small to avoid numerical instability. 33

Why ‘Spicy’ ? • Spicy. MKL = DAL + MKL. • DAL also means a major Indian cuisine. 　→ hot, spicy “Dal” 　→ Spicy. MKL from Wikipedia 34

Numerical Experiments 36

IDA data set, L 1 regularization. CPU time against # of kernels Splice Spicy. MKL Ringnorm • Scales well against # of kernels. 37

IDA data set, L 1 regularization. CPU time against # of samples Splice Spicy. MKL Ringnorm • Scaling against # of samples is almost the same as the existing methods. 38

UCI dataset # of kernels 189 243 918 891 1647 39

Comparison in elastic net (Sparse v. s. Dense) 40

Sparse or Dense ? Elastic Net where . 0 Sparse 1 Dense Compare performances of sparse and dense solutions in a computer vision task. 41

caltech 101 dataset (Fei-Fei et al. , 2004) • object recognition task • anchor, ant, cannon, chair, cup • 10 classification problems　 (10=5×(5 -1)/2) • # of kernels: 1760 = 4 (features)× 22 (regions) × 　あああああ2 (kernel functions) × 10 (kernel parameters) – sift features [Lowe 99]: color. Descriptor [van de Sande et al. 10] hsvsift, sift (scale = 4 px, 8 px, automatic scale) – regions: whole region, 4 division, 16 division, spatial pyramid. (computed histogram of features on each region. ) – kernel functions: Gaussian kernels, kernels with 10 parameters. 42

Performances are averaged over all classification problems. Best Performance of sparse solution is comparable with dense solution. dense Medium density 43

Conclusion and Future works まとめと今後の課題 Conclusion • We proposed a new MKL algorithm that is efficient when # of kernels is large. – proximal minimization – neglect ‘in-active’ kernels • Medium density showed the best performance, but sparse solution also works well. Future work • Second order update 44

• Technical report T. Suzuki & R. Tomioka: Spicy. MKL. ar. Xiv: http: //arxiv. org/abs/0909. 5026 • DAL R. Tomioka & M. Sugiyama: Dual Augmented Lagrangian Method for Efficient Sparse Reconstruction. IEEE Signal Proccesing Letters, 16 (12) pp. 1067 --1070, 2009. 45

Thank you for your attention! 46

Some properties Moreau envelope proximal operator (Fenchel duality th. ) Differentiable ! 0 47