Matrix Sketching over Sliding Windows Zhewei Wei 1
- Slides: 21
Matrix Sketching over Sliding Windows Zhewei Wei 1, Xuancheng Liu 1, Feifei Li 2, Shuo Shang 1 Xiaoyong Du 1, Ji-Rong Wen 1 1 School of Information, Renmin University of China 2 School of Computing, The University of Utah
Matrix data Data Rows Columns d n Textual Documents Words 105 – 107 >1010 Actions Users Types 101 – 104 >107 Visual Images Pixels, SIFT 105 – 106 >108 Audio Songs, tracks Frequencies 105 – 106 >108 Machine Learning Examples Features 102 – 104 >106 Financial Prices Items, Stocks 103 – 105 >106
Singular Value Decomposition (SVD) … … … … … 0 0 0 0 … … … • Principal component analysis (PCA) • K-means clustering • Latent semantic indexing (LSI) 0 0
SVD & Eigenvalue decomposition … … … … 0 0 … … … … 0 0 … … …
Matrix Sketching •
Matrix Sketching over Sliding Windows •
Motivation 1: Sliding windows vs. unbounded streams • Sliding window model is a more appropriate model in many real-world applications. • Particularly so in the areas of data analysis wherein matrix sketching techniques are widely used. • Applications: § Analyzing tweets for the past 24 hours. § Sliding window PCA for detecting changes and anomalies [Papadimitriou 2006, Qahtan 2015].
Motivation 2: Lower bound •
Three algorithms Sketches Update Space Window Interpretable? Sampling Sequence & time Yes LM-FD Sequence & time No DI-FD Sequence No
Three algorithms Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No DI-FD Slow Sequence No
Experiments: space vs. error Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No DI-FD Slow Sequence No
Experiments: time vs. space Sketches Update Space Window Interpretable ? Sampling Slow Large Sequence & time Yes LM-FD Fast Small Sequence & time No DI-FD Slow Sequence No
Conclusions • First attempt to tackle the sliding window matrix sketching problem. • Lower bounds show that the sliding window model is different from unbounded streaming model for the matrix sketching problem. • Propose algorithms for both time-based and sequencebased windows with theoretical guarantee and experimental evaluation.
Thanks!
Experiments • LM-FD provide better space-error tradeoffs than sampling algorithms. • DI-FD vs. LM-FD: depends on the ratio R • SWOR vs. SWR: depends on data set.
Experiments • Run algorithms on real world matrices. • Measure actual covariance error, space and update time. • Datasets for sequence-based windows: § SYNTHETIC: random noisy matrix, used by [Liberty 2013] § BIBD: incidence matrix of a Balanced Incomplete Block Design from Mark Giesbrecht, University of Waterloo § PAMAP: physical activity monitoring data set
Sampling based algorithms •
Maintaining top-1 priority over sliding window priority time Probability in the skyline:
Logarithmic Method: LM-FD algorithm •
Dyadic Interval: DI-FD algorithm
Low rank approximation … … … 0 0 … … … … 0 0 0 … 0 … … • Principal component analysis (PCA) • k-means clustering • Latent Semantic Indexing (LSI)