Interactive Visualization and Tuning of MultiDimensional Clusters for

Overview • Provide a framework to generate better clusters for high dimensional data points

Data, Data ! • Digital data creation at an unprecedented rate • Data is

More data ! • Flickr, Youtube, etc changed the game – Non-textual information (images)

Multi-dimensional Multi-variate data • Stock markets • Weather/climate • Business IIIT Hyderabad Huge datasets

Data Visualization • Human intelligence/cognition is unmatchable by computers • Cluster analysis – descriptive

Past Attempts! • XMDV tool (M. Ward) – Scatter-plot matrix – Parallel Coordinate Plot

Indexing images/videos • Extract feature vectors from images • Apply clustering to compute bag

Using SIFT features • The fundamental problem – sheer volume of data • No.

Clusters + visualization • The problem – choosing the right bag of words (clusters)

Cluster analysis • Provide a framework for user to – Identify better subspaces –

Extracted low-level image descriptors Statistical sampling Manageable size (high dimensional) Automatic weight recommendation 1

Why prioritize dimensions? • Dimensionality reduction !! – Feature transformation IIIT Hyderabad – Feature

Why not feature transformation? • Dimensions can be redundant/irrelevant – Hence PCA cant be

Feature selection • Wrapper model – “wrap” selection process around the mining algorithm –

“Interesting” dimensions • Without any rank – Analyze density distribution based on grids –

Ranked dimensions • Assign weights based on the amount of “interestingness” – 1 D

Glyph view • Standard SIFT glyph • Bar chart – Length – rank –

Data clustering • Sample data set – 1. 3 million points with 128 dimensions

Data clustering • Plug-in for any cluster technique – Currently using k-means (GPU) •

Cluster Viz. • Visualizing clusters over 128 dimensions – Not feasible • Re-project into

Graph representation • Compute cluster tree of nearest neighbor density – Similar nodes must

Graph drawing • Use a GPU implementation of force based graph layout – Takes

IIIT Hyderabad Similar looking regions clustered into the same id

Cluster validation • Two clustering schemas – Visually not feasible to compare Computationally not

Relative validity • Some indices – RS value – Davies-Bouldin index – SD index

Validity indices • Indices plotted over a line graph – Obtain min/max of the

Automatic weight recommendation • Only a suggestive process IIIT Hyderabad • Final decision left

Results on UIUC image collection • A total of 4485 images • 15 categories

Interesting observation • 135◦, 215◦, 270◦ – Lower weights assigned by automatic schemas •

Results on UIUC image collection • More clusters does not necessarily mean better classification

Summary • Provide a framework for better cluster generation • Provide fast cluster analysis/generation

Publications IIIT Hyderabad • Interactive Visualization and Tuning of SIFT Indexing, Dasari Pavan Kumar

Limitations • Limited by GPU and CPU memory • User needs to get familiarized

Future Work • Provide a brush for PCP view • Incorporate support for subspace

Slides: 42

Download presentation

Interactive Visualization and Tuning of Multi-Dimensional Clusters for Indexing Dasari Pavan Kumar (MS by Research Thesis) IIIT Hyderabad Centre for Visual Information Technology

Overview • Provide a framework to generate better clusters for high dimensional data points IIIT Hyderabad • Provide a fast cluster analysis/generation tool

Data, Data ! • Digital data creation at an unprecedented rate • Data is collected to extract/search “valuable” information – A difficult task however! • Data generation in previous decade consisted mostly of textual information IIIT Hyderabad – Inverted Index, suffix trees, N-grams, etc

More data ! • Flickr, Youtube, etc changed the game – Non-textual information (images) – Huge amounts of data! • New methods! (Content based Image Retrieval) – Underlying processes remain similar • Why image search? IIIT Hyderabad – Copyright Infringement, Offensive, Education, etc

Multi-dimensional Multi-variate data • Stock markets • Weather/climate • Business IIIT Hyderabad Huge datasets – multiple dimensions. Finding “insights” can’t be fully automated.

Data Visualization • Human intelligence/cognition is unmatchable by computers • Cluster analysis – descriptive modeling • Information Visualizations to support analysis IIIT Hyderabad – Identify important features/patterns

Past Attempts! • XMDV tool (M. Ward) – Scatter-plot matrix – Parallel Coordinate Plot • Cluster tree (Stuetzle) IIIT Hyderabad • Cone trees (Robertson et. al) What if you have millions of highdimensional data points?

Indexing images/videos • Extract feature vectors from images • Apply clustering to compute bag of words IIIT Hyderabad • Generate feature histogram and perform some ML methods

Using SIFT features • The fundamental problem – sheer volume of data • No. of dimensions – 128 • No. of data points – in millions IIIT Hyderabad • Other low-level image features exist – GLOH, steerable filter, spin images

Clusters + visualization • The problem – choosing the right bag of words (clusters) IIIT Hyderabad • Better visual words lead to better classification

Cluster analysis • Provide a framework for user to – Identify better subspaces – Efficiently/quickly compute clusters IIIT Hyderabad – Compare clustering schemas

Extracted low-level image descriptors Statistical sampling Manageable size (high dimensional) Automatic weight recommendation 1 Priority/Weight assignment to features Automatic weight recommendation N Clustering (Visual Words) Visualization system User defined weight re-assignment Bad Verification Framework Cluster entire set Good Output Schema

IIIT Hyderabad Tool

Extracted low-level image descriptors Statistical sampling Manageable size (high dimensional) Automatic weight recommendation 1 Priority/Weight assignment to features Automatic weight recommendation N Clustering (Visual Words) Visualization system User defined weight re-assignment IIIT Hyderabad Bad Verification Framework Cluster entire set Good Output Schema

Why prioritize dimensions? • Dimensionality reduction !! – Feature transformation IIIT Hyderabad – Feature selection

Why not feature transformation? • Dimensions can be redundant/irrelevant – Hence PCA cant be trivially applied • Clusters could be lost in cloud of dimensions (curse of dimensionality) IIIT Hyderabad • Difficult to interpret the combination

Feature selection • Wrapper model – “wrap” selection process around the mining algorithm – Go hand in hand giving little control • Filter model IIIT Hyderabad – Examine intrinsic properties

“Interesting” dimensions • Without any rank – Analyze density distribution based on grids – Difficult to compare since its highly dependent on density parameter • Rank dimensions – Based on distribution of data • Uniformity (Entropy) • No. of outliers • No. of unique values IIIT Hyderabad d>(Q 3+1. 5*IQR) || d<(Q 1 -1. 5*IQR)

Ranked dimensions • Assign weights based on the amount of “interestingness” – 1 D Histogram of distribution – 2 D correlations - PCP • How do we assign weights? • Manual IIIT Hyderabad – Automatic suggestions !

Glyph view • Standard SIFT glyph • Bar chart – Length – rank – Color - weight IIIT Hyderabad • Colormap

Data clustering • Sample data set – 1. 3 million points with 128 dimensions • Cluster such data on a commodity pc IIIT Hyderabad – Almost impossible

Data clustering • Plug-in for any cluster technique – Currently using k-means (GPU) • Currently 200 iterations for 1. 3 million SIFT vectors IIIT Hyderabad – 12 sec for each iteration for 1000 clusters

Cluster Viz. • Visualizing clusters over 128 dimensions – Not feasible • Re-project into 2 D space – Necessity for some sort of layout • Plug-in any graph drawing IIIT Hyderabad – Current – 2 D force based

Graph representation • Compute cluster tree of nearest neighbor density – Similar nodes must be close – Can be estimated using MST • Generate minimum spanning tree (MST) of cluster centers IIIT Hyderabad – Single linkage dendogram – Prim’s method

Graph drawing • Use a GPU implementation of force based graph layout – Takes 0. 2 sec for 1000 nodes IIIT Hyderabad • Drill-down “visual word” to actually see the “sift” interest points to understand the similarity MST with layout MST without layout

IIIT Hyderabad Similar looking regions clustered into the same id

Cluster validation • Two clustering schemas – Visually not feasible to compare Computationally not feasible • Three basic strategies – Internal – compare schema C with proximity matrix – External – build an independent partition according to our intuition • Comparison with schema C or proximity matrix. IIIT Hyderabad – Relative – choose the one that best fits !!

Relative validity • Some indices – RS value – Davies-Bouldin index – SD index GPU implementation takes 1 second IIIT Hyderabad Around 1 minute for each schema C on CPU

Validity indices • Indices plotted over a line graph – Obtain min/max of the graph – optimal clusters Nc IIIT Hyderabad Index Iteration

Automatic weight recommendation • Only a suggestive process IIIT Hyderabad • Final decision left to user

Results on UIUC image collection • A total of 4485 images • 15 categories IIIT Hyderabad • Mean classification accuracy of 57. 6% for SIFT with Do. G

Interesting observation • 135◦, 215◦, 270◦ – Lower weights assigned by automatic schemas • Same with corner cells IIIT Hyderabad • Ds = {4, 12, 22, 43, 44, 55, 71, 78, 79, 83, 84, 110, 116} 1 D histograms corresponding to dimensions (a)84, (b) 110, (c) 124

Results on UIUC image collection • More clusters does not necessarily mean better classification IIIT Hyderabad • Fei-Fei et al. report a mean accuracy of 52. 5%

Summary • Provide a framework for better cluster generation • Provide fast cluster analysis/generation tool for a commodity pc enabled with GPU • Able to analyze distributions across dimensions IIIT Hyderabad – Identified redundant dimensions • Able to achieve higher classification ratios with relative ease

Publications IIIT Hyderabad • Interactive Visualization and Tuning of SIFT Indexing, Dasari Pavan Kumar and P. J. Narayanan, Vision, Modelling and Visualization, 2010, Siegen, Germany

Limitations • Limited by GPU and CPU memory • User needs to get familiarized with the tool • Visual decoding of data is sometimes difficult IIIT Hyderabad • Cluster generation still depends on parameters like K (no. of clusters).

Future Work • Provide a brush for PCP view • Incorporate support for subspace clustering IIIT Hyderabad • Conduct experiments based on wrapper clustering methods

IIIT Hyderabad Thank you