Learning with Trees Rob Nowak University of WisconsinMadison

Learning with Trees Rob Nowak University of Wisconsin-Madison Collaborators: Rui Castro, Clay Scott, Rebecca Willett www. ece. wisc. edu/~nowak Artwork: Piet Mondrian

Basic Problem: Partitioning Many problems in statistical learning theory boil down to finding a good partition function partition

Classification Learning and Classification: build a decision rule based on labeled training data Labeled training features Classification rule: partition of feature space

Signal and Image Processing MRI data brain aneurysm Extracted vascular network Recover complex geometrical structure from noisy data

Partitioning Schemes Support Vector Machine image partitions

Why Trees ? • Simplicity of design • Interpretability • Ease of implementation • Good performance in practice Trees are one of the most popular and widely used machine learning / data analysis tools CART: Breiman, Friedman, Olshen, and Stone, 1984 Classification and Regression Trees C 4. 5: Quinlan 1993, C 4. 5: Programs for Machine Learning JPEG 2000: Image compression standard, 2000 http: //www. jpeg. org/jpeg 2000/

Example: Gamma-Ray Burst Analysis photon counts burst time Compton Gamma-Ray Observatory Burst and Transient Source Experiment (BATSE) One burst (10’s of seconds) emits as much energy as our entire Milky Way does in one hundred years ! x-ray “after glow”

Trees and Partitions coarse fine partition

Estimation using Pruned Tree piecewise constant fits to data on each piece of Each leaf corresponds to a sample f(ti ), i=0, …, N-1 the partition provides a good estimate

Gamma-Ray Burst 845 piecewise linear fit on each cell piecewise polynomial fit on each cell

Recursive Partitions

Adapted Partition

Image Denoising

Decision (Classification) Trees Bayes boundary labeled training data complete partition pruneddecision partition decision tree - majority vote at each leaf

Classification Ideal classfier 256 cells in each partition Adapted partition histogram

Image Partitions 1024 cells in each partition

Image Coding JPEG 0. 125 bpp non-adaptive partitioning JPEG 2000 0. 125 bpp adaptive partitioning

Probabilistic Framework

Prediction Problem

Challenge

Empirical Risk

Empirical Risk Minimization

Classification and Regression Trees

Classification and Regression Trees 1 1 1 1 0 0 0 0 0 1

Empirical Risk Minimization on Trees

Overfitting Problem crude accurate stable variable

Bias/Variance Trade-off large bias small variance small bias large variance coarse partition fine partition

Estimation and Approximation Error

Estimation Error in Regression

Estimation Error in Classification

Partition Complexity and Overfitting trust empirical risk no trust risk overfitting to data variance # leaves

Controlling Overfitting

Complexity Regularization

Per-Cell Variance Bounds: Regression

Per-Cell Variance Bounds: Classification

Variance Bounds

A Slightly Weaker Variance Bound

Complexity Regularization “small” leafs contribute very little to penalty

Example: Image Denoising This is special case of “wavelet denoising” using Haar wavelet basis

Theory of Complexity Regularization

mustache Classification eyes

Probabilistic Framework

Learning from Data 0 1

Approximation and Estimation 0 Approximation 1 BIAS Model selection VARIANCE

Classifier Approximations 0 1

Approximation Error Symmetric difference set Error

Approximation Error boundary smoothness risk functional (transition) smoothness

Boundary Smoothness

Transition Smoothness

Fundamental Limit to Learning Mammen & Tsybakov (1999)

Related Work

Box-Counting Class

Box-Counting Sub-Classes

Dyadic Decision Trees Bayes decision labeled pruned training RDP data complete RDP boundary Dyadic decision tree - majority vote at each leaf Joint work with Clay Scott, 2004

Dyadic Decision Trees

The Classifier Learning Problem Training Data: Model Class: Problem:

Empirical Risk

Chernoff’s Bound

Chernoff’s Bound actual risk is probably not much larger than empirical risk

Error Deviation Bounds

Uniform Deviation Bound

Setting Penalties

Setting Penalties prefix codes for trees: 0 0 0 1 0 1 1 code: 0001001111 + 6 bits for leaf labels

Uniform Deviation Bound

Decision Tree Selection Compare with : Oracle Bound: Approximation Error (Bias) Estimation Error (Variance)

Rate of Convergence BUT… Why too slow ?

Balanced vs. Unbalanced Trees same number of leafs all |T| leaf trees are equally favored

Spatial Adaptation local error local empirical error

Relative Chernoff Bound

Designing Leaf Penalties prefix code construction : 01 = “right branch” 00 010001110 01 11 = “terminate” 0/1 = “label”

Uniform Deviation Bound Compare with :

Spatial Adaptivity Key: local complexity is offset by small volumes!

Bound Comparison for Unbalanced Tree Non-adaptive bound: J leafs depth J-1 Adaptive bound:

Balanced vs. Unbalanced Trees same number of leafs

Decision Tree Selection Oracle Bound: Approximation Error Estimation Error

Rate of Convergence

Computable Penalty achieves same rate of convergence

Adapting to Dimension - Feature Rejection 1 0

Adapting to Dimension - Data Manifold

Computational Issues additive penalty Cyclic DDT: force coordinate splits in cyclic order Free-Split DDT: no order enforcement in splits

DDTs in Action

Comparison to State-of-Art ODCT = DDT + cross-validation Best results: (1) Ada. Boost with RBF-Network, (2) Kernel Fisher Discriminant, (3) SVM with RBF-Kernel,

Application to Level Set Estimation Elevation Map St. Louis Level set Noisy data Thresholded data Penalty proportional to |T| Spatially adapt. penalty

Conclusions and Future Work Open Problem: More Info: www. ece. wisc. edu/~nowak/ece 901