Prototypebased learning and adaptive distances for classification Michael

Prototype-based learning and adaptive distances for classification Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen www. cs. rug. nl/biehl

overview Basic concepts of similarity / distance based classification example system: Learning Vector Quantization (LVQ) application: Classification of Adrenal Tumors Distance measures and Relevance Learning predefined distances, e. g. divergence based LVQ application: Detection of Cassava Mosaic Disease adaptive distances, e. g. Matrix Relevance LVQ application: Classification of Adrenal Tumors (cont’d) extensions: combined distances, relational data (excursion: uniqueness and regularization of relevance matrices) Brain Inspired Computing, Cetraro, July 2013 2

Part I: Basic concepts of distance/similarity based classification

classification problems here only: supervised learning , classification: - character/digit/speech recognition - medical diagnoses - pixel-wise segmentation in image processing - object recognition/scene analysis - fault detection in technical systems - remote sensing. . . machine learning approach: extract information from example data parameterized in a learning system (neural network, LVQ, SVM. . . ) working phase: application to novel data Brain Inspired Computing, Cetraro, July 2013 4

distance based classification assignment of data (objects, observations, . . . ) to one or several classes (crisp/soft) (categories, labels) based on comparison with reference data (samples, prototypes) in terms of a distance measure (dis-similarity, metric) representation of data (a key step!) - collection of qualitative/quantitative descriptors - vectors of numerical features - sequences, graphs, functional data - relational data, e. g. in terms of pairwise (dis-) similarities Brain Inspired Computing, Cetraro, July 2013 5

K-NN classifier a simple distance-based classifier - store a set of labeled examples - classify a query according to the label of the Nearest Neighbor (or the majority of K NN) ? feature space + - - local decision boundary acc. to (e. g. ) Euclidean distances - piece-wise linear class borders parameterized by all examples conceptually simple, no training required, one parameter (K) expensive storage and computation, sensitivity to “outliers” can result in overly complex decision boundaries Brain Inspired Computing, Cetraro, July 2013

prototype based classification a prototype based classifier [Kohonen 1990, 1997] - represent the data by one or several prototypes per class - classify a query according to the label of the nearest prototype (or alternative schemes) ? feature space + - - local decision boundaries according to (e. g. ) Euclidean distances - piece-wise linear class borders parameterized by prototypes less sensitive to outliers, lower storage needs, little computational effort in the working phase training phase required in order to place prototypes, model selection problem: number of prototypes per class, etc. Brain Inspired Computing, Cetraro, July 2013

Nearest Prototype Classifier set of prototypes carrying class-labels nearest prototype classifier (NPC): based on dissimilarity/distance measure given - determine the winner - assign to the class reasonable requirements: most prominent example: Brain Inspired Computing, Cetraro, July 2013 (squared) Euclidean distance

Learning Vector Quantization N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e. g. Euclidean) heuristic scheme: LVQ 1 [Kohonen, 1990, 1997] • initialize prototype vectors for different classes • present a single example • identify the winner (closest prototype) • move the winner - closer towards the data (same class) - away from the data (different class) Brain Inspired Computing, Cetraro, July 2013

Learning Vector Quantization N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e. g. Euclidean) ∙ distance-based classification [here: Euclidean distances] ∙ tesselation of feature space [piece-wise linear] ∙ aim: discrimination of classes ( ≠ vector quantization or density estimation ) ∙ generalization ability correct classification of new data Brain Inspired Computing, Cetraro, July 2013

LVQ 1 iterative training procedure: randomized initial , e. g. close to the class-conditional means sequential presentation of labelled examples … the winner takes it all: LVQ 1 update step: learning rate many heuristic variants/modifications: - learning rate schedules ηw (t) - update more than one prototype per step Brain Inspired Computing, Cetraro, July 2013 [Kohonen, 1990, 1997] [Darken & Moody, 1992]

LVQ 1 update step: LVQ 1 -like update for generalized distance: requirement: update decreases (increases) distance if classes coincide (are different) Brain Inspired Computing, Cetraro, July 2013

Generalized LVQ one example of cost function based training: GLVQ minimize two winning prototypes: linear E favors large margin separation of classes, e. g. sigmoidal (linear for small arguments), e. g. E approximates number of misclassifications Brain Inspired Computing, Cetraro, July 2013 [Sato & Yamada, 1995]

GLVQ training = optimization with respect to prototype position, e. g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance Brain Inspired Computing, Cetraro, July 2013

GLVQ training = optimization with respect to prototype position, e. g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on Euclidean distance moves prototypes towards / away from sample with prefactors Brain Inspired Computing, Cetraro, July 2013

prototype/distance based classifiers + intuitive interpretation prototypes defined in feature space + natural for multi-class problems + flexible, easy to implement + frequently applied in a variety of practical problems - often based on purely heuristic arguments … or … cost functions with unclear relation to classification error - model/parameter selection (# of prototypes, learning rate, …) Important issue: which is the ‘right’ distance measure ? features may - scale differently - be of completely different nature - be highly correlated / dependent … Brain Inspired Computing, Cetraro, July 2013 simple Euclidean distance ?

related schemes Many variants of LVQ intuitive schemes: LVQ 2. 1, LVQ 3, OLVQ, . . . cost function based: RSLVQ (likelihood ratios) Supervised Neural Gas (NG) many prototypes, rank based update Supervised Self-Organizing Maps (SOM) neighborhood relations, topology preserving mapping Radial Basis Function Networks (RBF) hidden units = centers (prototypes) with Gaussian activation Brain Inspired Computing, Cetraro, July 2013

remark: the curse of dimension ? concentration of distances for large N „distance based methods are bound to fail in high dimensions“ ? ? ? LVQ: - prototypes are not just random data points - carefully selected low-noise representatives of the data - distances of a given data point to prototypes are compared projection to non-trivial low-dimensional subspace! see also: [Ghosh et al. , 2007, Witoelar et al. , 2010] models of LVQ training, analytical treatment in the limit successful training needs Brain Inspired Computing, Cetraro, July 2013 training examples

Questions ? ? Brain Inspired Computing, Cetraro, July 2013 20

tumor classification An example problem: classification of adrenal tumors Petra Schneider Han Stiekema Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen Wiebke Arlt , Angela Taylor Dave J. Smith, Peter Nightingale P. M. Stewart, C. H. L. Shackleton et al. School of Medicine Queen Elizabeth Hospital University of Birmingham/UK (+ several centers in Europe) [Arlt et al. , J. Clin. Endocrinology & Metabolism, 2011]

tumor classification adrenal gland ∙ adrenal tumors are common (1 -2%) and mostly found incidentally ∙ adrenocortical carcinomas (ACC) account for 2 -11% of adrenal incidentalomas ( ACA: adrenocortical adenomas ) ∙ conventional diagnostic tools lack sensitivity www. ensat. org and are labor and cost intensive (CT, MRI) ∙ idea: tumor classification based on steroid excretion profile Brain Inspired Computing, Cetraro, July 2013

tumor classification - urinary steroid excretion (24 hours) - 32 potential biomarkers - biochemistry imposes correlations, grouping of steroids Brain Inspired Computing, Cetraro, July 2013

tumor classification data set: 102 patients with benign ACA 45 patients with malignant ACC patient # ACA patient # color coded excretion values (log. scale, relative to healthy controls) # steroid marker Brain Inspired Computing, Cetraro, July 2013

tumor classification Generalized LVQ , training and performance evaluation ∙ data divided in 90% training and 10% test set ∙ determine prototypes by stochastic gradient descent typical profiles (1 per class) ∙ employ Euclidean distance measure in the 32 -dim. feature space ∙ apply classifier to test data evaluate performance (error rates) ∙ repeat and average over many random splits Brain Inspired Computing, Cetraro, July 2013

tumor classification ACA ACC Brain Inspired Computing, Cetraro, July 2013 prototypes: steroid excretion in ACA/ACC

tumor classification ∙ Receiver Operator Characteristics (ROC) [Fawcett, 2000] - no false alarms - no true positives detected om gu es si n g all tumors classified as ACA ra nd true positive rate (sensitivity) obtained by introducing a biased NPC: Area under Curve all tumors classified as ACC - all true positives detected false positive rate (1 -specificity) Brain Inspired Computing, Cetraro, July 2013 - max. number of false alarms

tumor classification GLVQ performance: ROC characteristics (averaged over splits of the data set) AUC=0. 87 Brain Inspired Computing, Cetraro, July 2013

Questions ? ? Brain Inspired Computing, Cetraro, July 2013 29

Part II: distance measures and relevance learning

distance measures fixed distance measures: - select distance measures according to prior knowledge - data driven choice in a preprocessing step - determine prototypes for a given distance - compare performance of various measures example: divergence based LVQ Brain Inspired Computing, Cetraro, July 2013 31

Relevance Matrix LVQ generalized quadratic distance in LVQ: [Schneider et al. , 2009] normalization: variants: one global, several local, class-wise relevance matrices → piecewise quadratic decision boundaries diagonal matrices: single feature weights rectangular discriminative low-dim. representation e. g. for visualization [Bunte et al. , 2012] possible constraints: rank-control, sparsity, … Brain Inspired Computing, Cetraro, July 2013 [Bojer et al. , 2001] [Hammer et al. , 2002]

Relevance Matrix LVQ optimization of prototypes and distance measure WTA Matrix-LVQ 1 Brain Inspired Computing, Cetraro, July 2013

Relevance Matrix LVQ optimization of prototypes and distance measure Generalized Matrix-LVQ (gradients of ) Brain Inspired Computing, Cetraro, July 2013

heuristic interpretation standard Euclidean distance for linearly transformed features summarizes - the contribution of the original dimension - the relevance of original features for the classification interpretation assumes implicitly: features have equal order of magnitude e. g. after z-score-transformation → (averages over data set) Brain Inspired Computing, Cetraro, July 2013 35

Relevance Matrix LVQ optimization of prototype positions distance measure(s) in one training process (≠ pre-processing) motivation: improved performance - weighting of features and pairs of features simplified classification schemes - elimination of non-informative, noisy features - discriminative low-dimensional representation insight into the data / classification problem - identification of most discriminative features - incorporation of prior knowledge (e. g. structure of Ω) Brain Inspired Computing, Cetraro, July 2013

tumor classification (cont’d) Generalized Matrix LVQ , ACC vs. ACA classification ∙ data divided in 90% training, 10% test set, [Arlt et al. , 2011] [Biehl et al. , 2012] (z-score transformed) ∙ determine prototypes typical profiles (1 per class) ∙ adaptive generalized quadratic distance measure parameterized by ∙ apply classifier to test data evaluate performance (error rates, ROC) ∙ repeat and average over many random splits Brain Inspired Computing, Cetraro, July 2013

$tumor classification Relevance matrix diagonal elements off-diagonal fraction of runs (random splits) in which$

tumor classification Relevance matrix diagonal elements off-diagonal fraction of runs (random splits) in which a steroid is rated among 9 most relevant markers subset of 9 selected steroids ↔ technical realization (patented, University of Birmingham/UK) Brain Inspired Computing, Cetraro, July 2013

tumor classification diagonal elements 19 ACA ACC Brain Inspired Computing, Cetraro, July 2013 discriminative e. g. steroid 19 off-diagonal

tumor classification diagonal elements off-diagonal 8 ACA ACC non-trivial role: steroid 8 among the most relevant! Brain Inspired Computing, Cetraro, July 2013

tumor classification weakly discriminative markers 12 Brain Inspired Computing, Cetraro, July 2013 8 highly discriminative combination of markers!

tumor classification ROC characteristics clear improvement due to GRLVQ 8 GMLVQ (sensitivity) adaptive distances AUC 0. 87 0. 93 0. 97 Euclidean diagonal rel. full matrix (1 -specificity) Brain Inspired Computing, Cetraro, July 2013

tumor classification observation / theory : low rank of resulting relevance matrix often: single relevant eigendirection Stationarity of Matrix Relevance LVQ [M. Biehl, B. Hammer, F. -M. Schleif, T. Villmann, IJCNN 2015, in press] intrinsic regularization nominally ~ Nx. N adaptive parameters in Matrix LVQ reduce to ~ N effective degrees of freedom low-dimensional representation facilitates, e. g. , visualization of labeled data sets Brain Inspired Computing, Cetraro, July 2013 eigenvalues in ACA/ACC classification

tumor classification visualization of the data set ACA ACC Brain Inspired Computing, Cetraro, July 2013

projection on second eigenvector a multi-class example classification of coffee samples based on hyperspectral data (256 -dim. feature vectors) [U. Seiffert et al. , IFF Magdeburg] projection on first eigenvector prototypes Brain Inspired Computing, Cetraro, July 2013

related schemes Linear Discriminant Analysis (LDA) one prototype per class + global matrix, different objective function! Relevance Learning related schemes in supervised learning. . . RBF Networks Neighborhood Component Analysis Large Margin Nearest Neighbor [Backhaus et al. , 2012] [Goldberger et al. , 2005] [Weinberger et al. , 2006, 2010] and many more! Relevance LVQ variants local, rectangular, structured, restricted. . . relevance matrices for visualization, functional data, texture recognition, etc. relevance learning in Robust Soft LVQ, Supervised NG, etc. combination of distances for mixed data. . . Brain Inspired Computing, Cetraro, July 2013

links Matlab collection: Relevance and Matrix adaptation in Learning Vector Quantization (GRLVQ, GMLVQ and Li. Ra. M LVQ) http: //matlabserver. cs. rug. nl/gmlvqweb/ Pre/re-prints etc. : http: //www. cs. rug. nl/~biehl/ Challenging data sets ? m. biehl@rug. nl Brain Inspired Computing, Cetraro, July 2013 47

Questions ? ? Brain Inspired Computing, Cetraro, July 2013 48

uniqueness / regularization quadratic distance measure (positive semi-definite pseudo-metric) intrinsic representation by linear transformation uniqueness (i) matrix square root is not unique* canonical representation, e. g. * irrelevant rotations, reflections, symmetries Brain Inspired Computing, Cetraro, July 2013

uniqueness of relevance matrix uniqueness (ii) given mapping: is possible if i. e. the rows of exists with lie in the null-space of → identical mapping of all examples and prototypes, same distances and classification scheme w. r. t. training data is singular if Brain Inspired Computing, Cetraro, July 2013 features are highly correlated, interdependent 50

uniqueness of relevance matrix a simple example consider two identical, entirely irrelevant features, e. g. contributions cancel exactly if (disregarded in the classification) but naïve interpretation of diagonal suggests high relevance! Brain Inspired Computing, Cetraro, July 2013 51

posterior null-space projection training process yields with eigenvectors determine and eigenvalues column space projection: with removes null-space contributions Note: minimizes formal solution: Brain Inspired Computing, Cetraro, July 2013 under the condition (Moore-Penrose pseudo-inverse) 52

posterior regularization training process yields with eigenvectors determine and eigenvalues regularization: with - retains the eigenspace corresponding to largest eigenvalues only - removes also eigenspace of (small) non-zero eigenvalues - smoothens the mapping, less data set specific - potentially improved generalization performance Brain Inspired Computing, Cetraro, July 2013 53

posterior regularization regularized mapping after/during training pre-processing of data (PCA-like) retains original features flexible K can include prototypes mapped feature space fixed K prototypes yet unknown (*) remark: prototypes are (close to) linear combinations of feature vectors when converged here: posterior regularization in classification schemes dependence of generalization performance on parameter K improved interpretability of the mapping / distance measure Brain Inspired Computing, Cetraro, July 2013 54

illustrative example alcohol content (binned) infra-red spectral data: 124 wine spamples 256 wavelengths 30 training data GMLVQ classification 94 test spectra (here) high correlation of features (neighbor channels) and P=30 → effective dimension ≪ 256 can be expected Brain Inspired Computing, Cetraro, July 2013 55

illustrative example over-fitting effect best performance 7 dimensions remaining null-space correction P=30 dimensions regularization (beyond column space projection) - potentially enhances generalization, controls over-fitting Brain Inspired Computing, Cetraro, July 2013 56

before and after … regularization - enhances generalization - smoothens relevance profile/matrix - removes ‘false relevances’ - improves interpretability of Λ Brain Inspired Computing, Cetraro, July 2013 57

links Matlab collection: Relevance and Matrix adaptation in Learning Vector Quantization (GRLVQ, GMLVQ and Li. Ra. M LVQ) http: //matlabserver. cs. rug. nl/gmlvqweb/ Pre/re-prints etc. : http: //www. cs. rug. nl/~biehl/ Brain Inspired Computing, Cetraro, July 2013 58

Questions ? ? Brain Inspired Computing, Cetraro, July 2013 59