Matrix Sketching over Sliding Windows Zhewei Wei 1

  • Slides: 21
Download presentation
Matrix Sketching over Sliding Windows Zhewei Wei 1, Xuancheng Liu 1, Feifei Li 2,

Matrix Sketching over Sliding Windows Zhewei Wei 1, Xuancheng Liu 1, Feifei Li 2, Shuo Shang 1 Xiaoyong Du 1, Ji-Rong Wen 1 1 School of Information, Renmin University of China 2 School of Computing, The University of Utah

Matrix data Data Rows Columns d n Textual Documents Words 105 – 107 >1010

Matrix data Data Rows Columns d n Textual Documents Words 105 – 107 >1010 Actions Users Types 101 – 104 >107 Visual Images Pixels, SIFT 105 – 106 >108 Audio Songs, tracks Frequencies 105 – 106 >108 Machine Learning Examples Features 102 – 104 >106 Financial Prices Items, Stocks 103 – 105 >106

Singular Value Decomposition (SVD) … … … … … 0 0 0 0 …

Singular Value Decomposition (SVD) … … … … … 0 0 0 0 … … … • Principal component analysis (PCA) • K-means clustering • Latent semantic indexing (LSI) 0 0

SVD & Eigenvalue decomposition … … … … 0 0 … … … …

SVD & Eigenvalue decomposition … … … … 0 0 … … … … 0 0 … … …

Matrix Sketching •

Matrix Sketching •

Matrix Sketching over Sliding Windows •

Matrix Sketching over Sliding Windows •

Motivation 1: Sliding windows vs. unbounded streams • Sliding window model is a more

Motivation 1: Sliding windows vs. unbounded streams • Sliding window model is a more appropriate model in many real-world applications. • Particularly so in the areas of data analysis wherein matrix sketching techniques are widely used. • Applications: § Analyzing tweets for the past 24 hours. § Sliding window PCA for detecting changes and anomalies [Papadimitriou 2006, Qahtan 2015].

Motivation 2: Lower bound •

Motivation 2: Lower bound •

Three algorithms Sketches Update Space Window Interpretable? Sampling Sequence & time Yes LM-FD Sequence

Three algorithms Sketches Update Space Window Interpretable? Sampling Sequence & time Yes LM-FD Sequence & time No DI-FD Sequence No

Three algorithms Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence & time

Three algorithms Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No DI-FD Slow Sequence No

Experiments: space vs. error Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence

Experiments: space vs. error Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No DI-FD Slow Sequence No

Experiments: time vs. space Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence

Experiments: time vs. space Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No DI-FD Slow Sequence No

Conclusions • First attempt to tackle the sliding window matrix sketching problem. • Lower

Conclusions • First attempt to tackle the sliding window matrix sketching problem. • Lower bounds show that the sliding window model is different from unbounded streaming model for the matrix sketching problem. • Propose algorithms for both time-based and sequencebased windows with theoretical guarantee and experimental evaluation.

Thanks!

Thanks!

Experiments • LM-FD provide better space-error tradeoffs than sampling algorithms. • DI-FD vs. LM-FD:

Experiments • LM-FD provide better space-error tradeoffs than sampling algorithms. • DI-FD vs. LM-FD: depends on the ratio R • SWOR vs. SWR: depends on data set.

Experiments • Run algorithms on real world matrices. • Measure actual covariance error, space

Experiments • Run algorithms on real world matrices. • Measure actual covariance error, space and update time. • Datasets for sequence-based windows: § SYNTHETIC: random noisy matrix, used by [Liberty 2013] § BIBD: incidence matrix of a Balanced Incomplete Block Design from Mark Giesbrecht, University of Waterloo § PAMAP: physical activity monitoring data set

Sampling based algorithms •

Sampling based algorithms •

Maintaining top-1 priority over sliding window priority time Probability in the skyline:

Maintaining top-1 priority over sliding window priority time Probability in the skyline:

Logarithmic Method: LM-FD algorithm •

Logarithmic Method: LM-FD algorithm •

Dyadic Interval: DI-FD algorithm

Dyadic Interval: DI-FD algorithm

Low rank approximation … … … 0 0 … … … … 0 0

Low rank approximation … … … 0 0 … … … … 0 0 0 … 0 … … • Principal component analysis (PCA) • k-means clustering • Latent Semantic Indexing (LSI)