Streaming Manifold Learning Shinjae Yoo Computational Science Initiative

  • Slides: 6
Download presentation
Streaming Manifold Learning Shinjae Yoo Computational Science Initiative

Streaming Manifold Learning Shinjae Yoo Computational Science Initiative

Streaming Unsupervised Feature Selection Scientific Achievement Streaming approximation (limited memory, one pass) of expansive

Streaming Unsupervised Feature Selection Scientific Achievement Streaming approximation (limited memory, one pass) of expansive manifold feature selection algorithm Significance and Impact Our proposed FSDF (Feature Selection on Data Stream) enable high velocity big data stream analysis and can be used with any machine learning algorithm as it is unsupervised learning algorithm Research Details • Matrix sketch based streaming decomposition with theoretical error bound analysis • Instead of expansive Lasso regression, adopt cheaper Ridge regression on orthonormal space • Dynamic feature subset selection as the topic or distribution changes • Compared to the state of the art batch unsupervised feature selection method (MCFS), Our proposed FSDS (Feature Selection on Data Stream) showed similar or better accuracy on 20 newsgroup data (top) and yet much better scalability (bottom) Huang, Hao, Shinjae Yoo, and Shiva Prasad Kasiviswanathan. "Unsupervised feature selection on data streams. " CIKM 2014.

Streaming Spectral Clustering (SSC) Scientific Achievement Streaming approximation (limited memory, one pass) of expansive

Streaming Spectral Clustering (SSC) Scientific Achievement Streaming approximation (limited memory, one pass) of expansive spectral clustering algorithm Significance and Impact Our proposed SSC enable big data spectral clustering analysis with a laptop and without big cluster, any one can do high quality clustering analysis Research Details • Streaming approximations of graph Laplacian matrix construction including degree matrix and symmetric normalization • Extended matrix sketch algorithms for bigger stream batch size processing for better streaming decomposition approximation • Adaptive to concept drift or distribution changes using the propose manifold embedding rotation matrix • Compared to the state of the art streaming clustering algorithms, Our proposed SSC showed much better accuracy (top) and yet comparable scalability (bottom) Yoo, Shinjae, Hao Huang, and Shiva Prasad Kasiviswanathan. "Streaming spectral clustering. " Data Engineering (ICDE), 2016 IEEE 32 nd International Conference on. IEEE, 2016.

Streaming Multidimensional Scaling (MDS) Scientific Achievement Streaming approximation (limited memory, one pass) of expansive

Streaming Multidimensional Scaling (MDS) Scientific Achievement Streaming approximation (limited memory, one pass) of expansive classical MDS manifold visualization algorithm Significance and Impact Our proposed sc. MDS enable big data visual analytics with a laptop and without big cluster. Any one can do visual analytics without expansive Research Details • Streaming approximations of bi-centering matrix construction including feature mean and normalization with theoretical guarantees • Efficient realignment of embedding coordinates over concept drift or distribution changes • Adopt matrix sketch based decomposition approximation • Three selected topic visualization after processing entire 20 news group data, which requires significant memory. Top shows the results generated from classical MDS (batch version) and the bottom is generated from our proposed sc. MDS (streaming version). Xi Zhang, et al. “Streaming Multidimensional Scaling”, under review

SCMDS - ALGORITHM scmds 5

SCMDS - ALGORITHM scmds 5

Performance Anomaly Detection Scientific Achievement Detect interesting performance anomalies for individual job or a

Performance Anomaly Detection Scientific Achievement Detect interesting performance anomalies for individual job or a whole workflow Significance and Impact Too much performance trace information to capture but our proposed performance anomaly detection enable to point what to take a look Research Details • Collect and preprocess large scale HPC performance traces. The current model analyze only function call execution time anomalies • Apply popular LOF and i. Forest anomaly detection algorithms • Integrated with performance visualization to easy to navigate performance anomalies X axis is a normalized execution time (0~1) and Y axis is an execution time. Gray dots are classified as normal and colored dots are classified as abnormal. The number of anomalies and the degree of local anomaly are hyper parameters to tune from the algorithm.