Nonlinear Principal Manifolds a Useful Tool in Bioinformatics
Non-linear Principal Manifolds a Useful Tool in Bioinformatics and Medical Applications Andrei Zinovyev Institute des Hautes Etudes Scientifique, France
Plan of the talk n n Object of study Definition of principal manifold (PM) Constructing PMs: elastic maps Examples of biomedical applications
Principal manifolds Elastic maps framework LLE ISOMAP Multidim. scaling Visualization Non-linear Data-mining methods PCA Kmeans Principal manifolds SOM Supervised classification SVM Clustering Regression, approximation Factor analysis
Finite set of objects in RN IRIS database Xi Petal heght Petal width Sepal height 4. 9 3 1. 4 0. 2 Iris-setosa 4. 7 3. 2 1. 3 0. 3 Iris-setosa 4. 6 3. 1 1. 5 0. 2 Iris-setosa 7 3. 2 4. 7 1. 4 Iris-versicolor 6. 4 3. 2 4. 5 1. 5 Iris-versicolor 6. 9 3. 1 4. 9 1. 5 Iris-versicolor 6. 3 3. 3 6 5. 8 2. 7 7. 1 6. 3 SPECIES 2. 5 Iris-virginica X 1. 9 Iris-virginica 3 5. 9 2. 1 Iris-virginica 2. 9 5. 6 1. 8 Iris-virginica i=1. . m
Mean point K-means clustering
Principal “Object”
Principal Component Analysis M ax im al dis p ers ion 1 st Principal axis 2 nd principal axis
Principal manifold
What do we want? n Non-linear surface (1 D, 2 D, 3 D …) Smooth and not twisted The data model is unknown Speed (time linear with Nm) Uniqueness n Fast way to project datapoints n n
Metaphor of elasticity U(E), U(R) Data points U(Y) Graph nodes
Constructing elastic nets y E (0) E (1) R (0) R (2)
Definition of elastic energy Xj y E (0) E (1). R (1) R (0) R (2)
Elastic manifold
Global minimum and softening 0, 0 103 0, 0 102 0, 0 101 0, 0 10 -1
Adaptive algorithms Refining net: Growing net Idea of scaling: Adaptive net
Projection onto the manifold Closest node of the net Closest point of the manifold
Colorings: visualize any function
Density visualization
Example: different topologies RN R 2
VIDAExpert tool and elmap C++ package
Regression and principal manifolds principal component regression F(x) x
Image skeletonization or clustering around curves
Approximation of molecular surfaces
Application: economical data Density Gross output Profit Growth temp
Medical table 1700 patients with infarctus myocarde Patients map, density Lethal cases
Medical table 1700 patients with infarctus myocarde 128 indicators Age Numberof infarctus in anamnesis Stenocardia functional class
Codon usage in all genes of one genome Escherichia coli Bacillus subtilis Majority of genes “Foreign” genes Highly expressed genes “Hydrophobic” genes
Golub’s leukemia dataset 3051 genes, 38 samples (ALL/B-cell, ALL/T-cell, AML) Map of genes: vote for ALL sample vote for AML used by T. Golub AML sample used by W. Lie
Golub’s leukemia dataset map of samples: AML ALL/B-cell Cystatin C density CA 2 Carbonic anhydrase II ALL/T-cell Retinoblastoma binding protein P 48 X-linked Helicase II
Thank you for your attention! n Questions?
- Slides: 30