Visualizing HighDimensional Data Advanced Reading in DeepLearning and

Where is 3 o’clock? Can you find a really happy smiley?

1053 emojies ordered by Visual Similarity Completely Unsupervised – Yet “labels” appear http: //prostheticknowledge.

Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem • t-SNE •

Motivation Two sides to data visualization: • Data Exploration – Making sure you understand

Motivation • With high-dimensional data, both stages are potentially difficult. • Standard visualization methods

Problem Formulation So, how can we visualize high dimensional data? Learn an embedding, That

Formal Framework minimize an objective function that measures the discrepancy between similarities in the

PCA • Recall the PCA objective: Project data onto a lower dimensional subspace, such

In complex datasets, large distances are usually less indicative

SNE (Stochastic Neighbor Embedding) Hinton & Roweis, 2002, Advances in Neural Information Processing Systems

• Similarity of data points in high-dimension • Similarity of data points in

SNE (Stochastic Neighbor Embedding) weight Error term Different errors have a different “cost” •

Optimization • Optimization is performed using Gradient Descent. • IMPORTANT: KL-divergence is not convex

Are we overfitting? “classic” Machine Learning VS Visualization Goal: Generalization Goal: Visualization Given a

Crowding Problem The main problem with SNE: Points tend to “crowd” together in the

Reminder - Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem •

Mismatched tails can compensate for mismatched dimensionalities https: //github. com/oreillymedia/t-SNE-tutorial

t-SNE van der Maaten & Hinton, 2008, Journal of Machine Learning Research • Main

https: //github. com/oreillymedia/t-SNE-tutorial

Entry (i, j) = first iteration last iteration

similarity matrix of the map points converges to the similarity matrix of the data

t-SNE gradient interpretation https: //www. slideshare. net/ssuserb 667 a 8/visualization-data-using-tsne

t-SNE gradient interpretation exertion / compression

t-SNE gradient interpretation Run time: limited to datasets with only few thousands of points!

Scalability – Barnes-Hut-SNE van der Maaten, 2013, ar. Xiv: 1301. 3342 v 2 •

Barnes-Hut Approximation • Many of the pairwise interactions between points are very similar:

Barnes-Hut Approximation • Approximate similar interactions by a single interaction x 3

All 70, 000 MNIST images Improvement in run time – from days to ~10

Word 2 vec • Input: large corpus of text • Embed words into a

http: //nlp. yvespeirsman. be/blog/visualizing-word-embeddings-with-tsne/

Limitations of visualizing words with t-SNE River Bank Bailout http: //homepage. tudelft. nl/19 j

Multiple maps t-SNE Hinton & van der Maaten, 2011, https: //link. springer. com/article/10. 1007/s

Multiple maps t-SNE – formulation • Define • Same cost function as before, now

Visualization of genetic disease – phenotype similarities by multiple maps t-SNE with Laplacian regularization

• Phenotypes feature vectors – every entry represents a Me. SH (Medical Subject

t-SNE as an instance of Manifold Learning • Manifold: a space that locally resembles

When can t-SNE fail? • Local linearity assumption fails when • Data is noisy

Data is on a highly varying manifold • Intrinsic dimension of a signal describes

Data is on a highly varying manifold Possible solution – Autoencoders • Input &

Conclusions • t-SNE is a state of the art method for visualizing multidimensional datasets

Slides: 63

Download presentation

Visualizing High-Dimensional Data Advanced Reading in Deep-Learning and Vision, Spring 2017 Gal Yona & Aviv Netanyahu

Where is 3 o’clock? Can you find a really happy smiley?

1053 emojies ordered by Visual Similarity Completely Unsupervised – Yet “labels” appear http: //prostheticknowledge. tumblr. com/post/11432984 8096/1053 -emojis-ordered-by-similarity-visualization-by

Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem • t-SNE • O(n log n)-approximation • Applications • Multiple Maps t-SNE • Manifold Learning

Motivation Two sides to data visualization: • Data Exploration – Making sure you understand your data • Data Communication – Making sure others understand your insights and/or can use your data easily

Motivation • With high-dimensional data, both stages are potentially difficult. • Standard visualization methods can usually capture only one or two variables at a time

Problem Formulation So, how can we visualize high dimensional data? Learn an embedding, That preserves as much of the significant structure of the highdimensional data as possible in the low-dimensional map.

Formal Framework minimize an objective function that measures the discrepancy between similarities in the data and similarities in the map

Why not just PCA?

PCA • Recall the PCA objective: Project data onto a lower dimensional subspace, such that the variance is maximized • Why the bad results? 1. Linear projection 2. Mostly preserves distances between dissimilar points, But is that really what we want for the purpose of visualization?

In complex datasets, large distances are usually less indicative

SNE (Stochastic Neighbor Embedding) Hinton & Roweis, 2002, Advances in Neural Information Processing Systems • In contrast to PCA, SNE focuses on maintaining the nearest neighbors in the lower dimensional map. • SNE converts the pair-wise Euclidean distances between points into a probability density, high low

• Similarity of data points in high-dimension • Similarity of data points in low-dimension If we were able to perfectly preserve all the similarities in the data, and will be equivalent.

SNE (Stochastic Neighbor Embedding)

SNE (Stochastic Neighbor Embedding) weight Error term Different errors have a different “cost” • Close points mapped to far points ( high, low) high penalty • Far points mapped to close points ( low, high) low penalty SNE focuses on preserving the local structure of the data

PCA on MNIST (0 -9) SNE on MNIST (0 -5)

Optimization • Optimization is performed using Gradient Descent. • IMPORTANT: KL-divergence is not convex – no guaranties! • Techniques to avoid “bad” local minimum: • Large momentum (“keep going”) • Simulated Annealing (random noise) Good visualization require good choice of parameters

Are we overfitting? “classic” Machine Learning VS Visualization Goal: Generalization Goal: Visualization Given a Training set, Do well on a Test set. We just want to “do well” on our data (“training set”) Overfitting is undesirable “Overfitting” is desirable

Crowding Problem The main problem with SNE: Points tend to “crowd” together in the center of the map

0 0 50 5000

Reminder - Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem • t-SNE • O(n log n)-approximation • Applications • Multiple Maps t-SNE • Manifold Learning

Mismatched tails can compensate for mismatched dimensionalities https: //github. com/oreillymedia/t-SNE-tutorial

t-SNE van der Maaten & Hinton, 2008, Journal of Machine Learning Research • Main difference: instead of:

https: //github. com/oreillymedia/t-SNE-tutorial

Entry (i, j) = first iteration last iteration

similarity matrix of the map points converges to the similarity matrix of the data points https: //github. com/oreillymedia/t-SNE-tutorial

Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem • t-SNE • O(n log n)-approximation • Applications • Multiple Maps t-SNE • Manifold Learning

t-SNE gradient interpretation https: //www. slideshare. net/ssuserb 667 a 8/visualization-data-using-tsne

t-SNE gradient interpretation spring

t-SNE gradient interpretation exertion / compression

t-SNE gradient interpretation

t-SNE gradient interpretation Run time: limited to datasets with only few thousands of points!

Scalability – Barnes-Hut-SNE van der Maaten, 2013, ar. Xiv: 1301. 3342 v 2 • I- implementation of t-SNE • Reduce number of pairwise forces that needs to be computed • Idea – forces exerted by a group of points on a point that is relatively far away are all very similar

Barnes-Hut Approximation • Many of the pairwise interactions between points are very similar:

Barnes-Hut Approximation • Approximate similar interactions by a single interaction x 3

All 70, 000 MNIST images Improvement in run time – from days to ~10 minutes

Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem • t-SNE • O(n log n)-approximation • Applications • Multiple Maps t-SNE • Manifold Learning

Image. Net

Image. Net Alex. Net

Word 2 vec • Input: large corpus of text • Embed words into a high-dim space • words with common contexts in the corpus are close in the space

http: //nlp. yvespeirsman. be/blog/visualizing-word-embeddings-with-tsne/

Limitations of visualizing words with t-SNE River Bank Bailout http: //homepage. tudelft. nl/19 j 49/multiplemaps/Multiple_maps_t-SNE. html

Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem • t-SNE • O(n log n)-approximation • Applications • Multiple Maps t-SNE • Manifold Learning

Multiple maps t-SNE Hinton & van der Maaten, 2011, https: //link. springer. com/article/10. 1007/s 10994 -011 -5273 -4 • Construct multiple maps, give each object: • A point in each map • An importance weight for each point in each map Map 1 0 1/2 1 Map 2 River Bank Bailout 1 1/2 0 River Bank Bailout

Multiple maps t-SNE – formulation • Define • Same cost function as before, now optimized w. r. t. N x M low-dim map points and N x M importance weights

Word associations dataset

Visualization of genetic disease – phenotype similarities by multiple maps t-SNE with Laplacian regularization • Diseases may be difficult to accurately diagnose, due to specific combinations of confounding symptoms (different phenotypes) https: //www. ncbi. nlm. nih. gov/pmc/articles/PMC 4243097/

• Phenotypes feature vectors – every entry represents a Me. SH (Medical Subject Headings) concept • t-SNE input: phenotype similarity matrix • Text – phenotype ID • Size – weight • Color – disease category

Topics • Motivation • SNE (Stochastic Neighbor Embedding) • Crowding Problem • t-SNE • O(n log n)-approximation • Applications • Multiple Maps t-SNE • Manifold Learning

t-SNE as an instance of Manifold Learning • Manifold: a space that locally resembles Euclidean space near each point • Manifold learning: approach to non-linear dimensionality reduction • Algorithms for this task based on the idea that dimensionality of dataset is only artificially high • t-SNE assumes local linearity (Euclidean distances between neighbors) t-SNE is an instance of manifold learning

When can t-SNE fail? • Local linearity assumption fails when • Data is noisy • Data is on a highly varying manifold

Data is noisy Possible solution - PCA

Data is on a highly varying manifold • Intrinsic dimension of a signal describes how many variables are needed to represent it

Data is on a highly varying manifold • Intrinsic dimension of a signal describes how many variables are needed to represent it • Curse of intrinsic dimensionality - in datasets with high intrinsic dimensionality, the local linearity assumption on the manifold that t-SNE implicitly makes may be violated

Data is on a highly varying manifold Possible solution – Autoencoders • Input & output – high-dim points • Reconstruct output from lowerdim “code” layer • Objective – output a point close to input point

Conclusions • t-SNE is a state of the art method for visualizing multidimensional datasets in 2 D or 3 D • Scalable to large datasets • Has been successfully applied in a wide range of tasks

Questions?