Time Series Compressibility and Privacy Spiros Papadimitriou Feifei
Time Series Compressibility and Privacy Spiros Papadimitriou* Feifei Li+ George Kollios+ Philip S. Yu* *IBM TJ Watson +Boston University
Intuition / Motivation n Introduce uncertainty about individual values, while still allowing interesting pattern mining speed 55 mph highway 35 mph city time 2
Intuition / Motivation n Introduce uncertainty about individual values, while still allowing interesting pattern mining speed 55 mph highway 35 mph city 3 Need to publish some value within the band: time which one?
Random (white noise) ? Completely random permutation? n Cars (typically) don’t drive like this ) Noise can be filtered out speed n time 4
Deterministic ? Completely “deterministic” permutation? n True value leaks speed n time 5
First extreme case White noise Completely random 6
Summary of extreme cases Completely random Adaptively combine completely random and completely “deterministic” ? Completely “deterministic” 7 ?
Main challenge Knowledge of an arbitrary number of true values Knowledge of signal’s subspace (“shape”) with arbitrary precision Completely random Completely “deterministic” Combining both 8
Goals n Partial “information hiding” via data perturbation, for time series n Perturbation adapts to data properties ¨ Automatically combines “random” and “deterministic” at appropriate scales n Evaluate against both ¨ Filtering ¨ True n 9 value leaks Suitable for on-the-fly, streaming perturbation
Overview Definitions n Method n Experiments n Conclusion n 10
Utility = discord time n 11 Published values are (on expectation) within of the true values :
Privacy = final uncertainty time n 12 Recovered values expectation) within are (on of the true values :
Goal n Recovery of true values is based on assumptions about attack model, with specific background knowledge ¨ Linear filtering ¨ Linear reconstruction (based on true values) n 13 Goal:
Overview Definitions n Method n Experiments n Conclusion n 14
Wavelet and Fourier representations Time 15 Scale (frequency) Frequency One-slide refresher Time
Our work n Fourier-based perturbation ¨ Batch n Wavelet-based perturbation ¨ Batch ¨ Streaming 16
Fourier-based perturbation Original series Intuition 17 = Time domain 0 0 Perturbed series Perturbation + 0 0 0 100 Freq. domain Energy concentrated in few coefficients: high compression
Fourier-based perturbation Frequency Intuition & Summary Time 18
Wavelet-based perturbation Scale (frequency) Intuition & Summary Time 19 Time Next: How to do this online? (1) Wavelet transform; (2) Noise allocation
Streaming perturbation (1) Wavelet transform—Summary n 1 2 4 5 Forward transform: post-order traversal O(lg. N) space n O(1) time (amortized) n 3 7 20 6
Streaming perturbation (1) Wavelet transform—Summary n 1 3 2 4 5 6 6 7 Inverse transform: pre-order traversal O(lg. N) space n O(1) time (amortized) n 3 4 21 2 7 5 1
Streaming perturbation (2) Noise allocation—Summary Challenge: ¨ Knowing only the wavelet coefficients up to the current time ¨ How can we allocate the noise online so that it is as close as possible to the batch allocation? Indefinite publication delay? 22
Streaming perturbation (2) Noise allocation—Summary Batch 23 Exceeds threshold Perturbed Per-band lookahead [see paper for details]
Overview Definitions n Method n Experiments n Conclusion n 24
Experimental overview n Datasets: ¨ Chlorine: Chlorine concentration in drinkable water distribution network ¨ Light: Light intensity measurements (Intel Berkeley) ¨ SP 500: Standards & Poors 500 index Chlorine 25 Light SP 500
Experimental overview n Varying ¨ Discord levels, and ¨ Perturbation methods: IID n Fourier-based (FFT) n Batch wavelet-based (DWT) n Streaming wavelet-based (str. DWT) n Filter: wavelet shrinkage [Donoho / TOIT 95] n True values: linear regression n 26
Removed uncertainty Removed noise (%) Perturbation method Discord (% RMS) 27
Removed uncertainty n Average (over ten runs): ¨ IID noise: excellent resilience to leaks, very poor filtering ¨ Other methods: comparable 28
Removed uncertainty Light n Maximum (over ten runs): ¨ Fourier may perform poorly for “non-smooth” signals 29
“True” uncertainty Remaining noise Discord (% RMS) 30 (% RMS)
“True” uncertainty n Average (over ten runs): ¨ IID noise: very poor overall ¨ Other methods: comparable 31
“True” uncertainty n Maximum (over ten runs): ¨ Fourier may perform poorly for “non-smooth” signals 32
Scalability 33 Constant per measurement
Overview Definitions n Method n Experiments n Conclusion n 34
Related work (1/2) n Privacy-preserving data mining ¨ ¨ SMC [Lindel & Pinkas / CRYPTO 00], [Vaidya & Clifton / KDD 02] Partial information hiding n Perturbation [Agrawal & Srikant / SIGMOD 00], [Du & Zhan / KDD 03], [Kargupta, Datta, Wang & Sivakumar / ICDM 03], [Agrawal & Aggarwal / EDBT 04], [Chen & Liu / ICDM 05], [Huang, Du & Chen / SIGMOD 05], Ryan & Kargupta / TKDE 05], [Li et al. / ICDE 07] n k-anonymity [Sweeney / IJUFKS 02] , [Aggarwal & Yu / EDBT 04], [Bertino, Ooi, Yang & Deng / ICDE 05], [Kifer & Gehrke / SIGMOD 06], [Machanwajjala, Gehrke & Kifer / ICDE 06], [Xiao & Tao / SIGMOD 06] ¨ Interactive privacy [Blum, Dwork, Mc. Sherry & Nissim / PODS 05], [Dwork, Mc. Sherry, Nissim, Smith / TCC 06] n SSDBs [Denning / TODS 80] Wavelets in DM [Liu, [Gilbert, Kotidis, Muthukrishnan & Strauss / VLDB 01], [Garofalakis & Gibbons / SIGMOD 02], [Bulut & Singh / ICDE 03], [Papadimitriou, Brockwell & Faloutsos / VLDB 04], [Lin, Vlachos, Keogh & Gunopulos / EDBT 04], [Karras & Mamoulis / VLDB 05] n Compression and DM [Keogh, Lonardi & Ratanamahatana / KDD 04] n 35
Related work (2/2) n Correlated perturbation n L-diversity [Machanwajjala, Gehrke & Kifer / ICDE 06] and personalized privacy [Xiao & Tao / SIGMOD 06] Dimensionality curse and privacy n n n 36 [Kargupta, Datta, Wang & Sivakumar / ICDE 03], [Huang, Du & Chen / SIGMOD 05], for streams [Li et al. / ICDE 07] [Aggarwal / VLDB 05] Watermarking [Sion, Attalah & Prabhakar / TKDE 06] Compressed sensing [Donoho / TOIT 06], [Candés, Romberg & Tao / TOIT 06]
Conclusion n Partial information hiding via data perturbation User-defined discord (utility) Adapts to data properties ¨ Automatically combines “random” and “deterministic” at appropriate scales ¨ Additionally preserves spectral properties n Evaluate against both ¨ Filtering ¨ True n value leaks Suitable for on-the-fly, streaming perturbation Perturbing data objects with any “structure” is non-trivial, even under fixed attack model(s) 37
Time Series Compressibility and Privacy Spiros Papadimitriou* Feifei Li+ George Kollios+ Philip S. Yu* *IBM TJ Watson +Boston University
- Slides: 38