Efficiently Monitoring Topk Pairs over Sliding Windows Computer

  • Slides: 17
Download presentation
Efficiently Monitoring Top-k Pairs over Sliding Windows Computer Science and Engineering Presented By: Zhitao

Efficiently Monitoring Top-k Pairs over Sliding Windows Computer Science and Engineering Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema 1, Xuemin Lin 21, Wenjie Zhang 1, Haixun Wang 3 1 The University of New South Wales, Australia 2 East China Normal University 3 Microsoft Research Asia

Introduction Top-k Pairs Query: • Given a scoring function score() that computes the score

Introduction Top-k Pairs Query: • Given a scoring function score() that computes the score of a pair of objects, return k pairs of objects with the smallest scores. Examples: • k closest pairs queries • k furthest pairs queries Top-k Pairs against sliding windows • Given a data stream, return top-k pairs among the most recent N objects. Applications • Wireless sensor network, stock market, traffic monitoring and transaction monitoring 2

Motivation No existing work for general pairs queries over sliding windows Select a. id,

Motivation No existing work for general pairs queries over sliding windows Select a. id, b. id from trans a, trans b Support arbitrary scoring functions. where a. id <> b. id and a. account = b. account order by |a. time - b. time| - dist(a. loc, b. loc) limit k Example: window [24 hours] Fraud detection over transaction streams – Query the transaction pairs that have small time difference but the locations are far away. 203 -13845 10: 15: 20 New York $1000 203 -13845 3 10: 18: 10 L. A. $1000

Problem Definitions (Preliminaries) Sliding Windows – A sliding window contains most recent N objects

Problem Definitions (Preliminaries) Sliding Windows – A sliding window contains most recent N objects of the data stream. – The number of pairs is N(N – 1) / 2 The age of a pair depends on the older object. older . . . o 7 o 6 Age of an object: 5 o 4 o 3 o 2 o 10 4 Sliding 3 window 2 1 of size 05 Lower bound runtime cost : O(N) for each new object Lower bound storage cost : O(N) 4 newer

Contributions Unified framework • First to study top-k pairs queries over sliding windows. •

Contributions Unified framework • First to study top-k pairs queries over sliding windows. • Support arbitrarily complex scoring functions • Support efficient queries for any window size n ≤ N and any k ≤ K Lower bound Expected cost for our algorithms 5 Storage requirement O(N) + O(K log(N/K)) for each scoring function Skyband maintenance cost for each object O(N) O(N (log N) + log K)) Answering top-k pairs O(k) O(log n) + log K + k)

Preliminaries o 4 Task 1 : how we efficiently maintain the K -skyband Task

Preliminaries o 4 Task 1 : how we efficiently maintain the K -skyband Task 2 : how we use the Kskyband to efficiently obtain top-k pairs against any sliding window n ≤ N Score Map all the pairs to an age–score space Top-2 pairs p 6 p 1 p 9 p 8 p 7 p 4 2 3 o 1 p 2 dominates p 5 because p 2. score < p 5. score and p 2 expires no later than p 5. Naive: O(N |SKB|) for checking all N-1 pairs Our: O(N log|SKB|) 4 Age K-skyband[Papadias et al. , TODS 05] keeps the minimum set for the candidate results. 6 o 0 Expected size of skyband is O(K log(N/K)) p 5 p 2 1 o 2 p 1(o 0, o 1) (p 1. age, p 1. score) (1, 3) p 10 p 3 o 3

Efficient Skyband Maintenance p 5 s 1 How can we efficiently compute the K-staircase

Efficient Skyband Maintenance p 5 s 1 How can we efficiently compute the K-staircase and K-skyband? Update the K-staircase and K-skyband in O(|SKB| log K)), Score Can we find a boundary between the skyband points and non-skyband points? p 1 p 2 K-staircase p 7 s 2 p 6 Check if a pair is dominated by K-skyband in O(log |SKB|) time for each new pair by doing binary search. p 3 s 1 p 5 p 4 2 -skyband 7 s 2 Age p 1

Efficient Query Answering Use Priority Search Tree to index the skyband points Score Can

Efficient Query Answering Use Priority Search Tree to index the skyband points Score Can we do better for any sliding window size n < N? Any. Window windowsize==Nn < N p 2 p 1 p 3 p 5 p 1 p 4 p 7 p 6 p 5 p 8 p 7 2 -skyband 8 p 2 3 p 6 1 2 6 p 3 4 p 4 9 8 5 Priority Search Tree Age Self-balancing tree Efficient 3 -sides range query

Efficient Query Answering An algorithm similar to post-order traversal costs O(log|SKB| + k) Score

Efficient Query Answering An algorithm similar to post-order traversal costs O(log|SKB| + k) Score Our contribution: Retrieve top-k pairs in the 1 -sided range. Any window size = n < N p 2 p 1 p 3 p 5 p 1 p 4 p 7 p 6 p 5 p 8 p 7 2 -skyband 9 p 2 3 p 6 1 2 6 p 3 4 p 4 5 Priority Search Tree Age 8 9

What else in the paper? Efficient continuous queries on the skyband. • Continuously monitoring

What else in the paper? Efficient continuous queries on the skyband. • Continuously monitoring the top-k results for any fixed k (k ≤ K) and n (n ≤ N). • Amortized O(k/n (log |SKB| + k)) time per update. Optimization on monotonic scoring functions. • Handling the k-closest pairs, k-furthest pairs queries. • Applying Threshold Algorithm on sorted lists • Improving the number of considered pairs for each new object from N to (d+1) N d/(d+1) K 1/(d+1) 10

Experimental Settings Real dataset. – Sensor data in the Intel research lab – 2.

Experimental Settings Real dataset. – Sensor data in the Intel research lab – 2. 3 million records. Synthetic data. – Uniform, correlated anti-correlated distributions. – 2 million objects – Closest and furthest pairs in Manhattan distance 11

Experiments (Overall Cost on real data) SCase: our algorithm using K-staircase to maintain the

Experiments (Overall Cost on real data) SCase: our algorithm using K-staircase to maintain the skyband. Naïve: maintains k. N pairs and sort them on their scores. LB: shows lower bound cost Varying K 12 Varying N (in thousands)

Experiments (Query Answering) Linear: scan the skyband points to find the top-k pairs. Snapshot:

Experiments (Query Answering) Linear: scan the skyband points to find the top-k pairs. Snapshot: our snapshot query algorithm. Continuous: our continuous query algorithm. LB: an algorithm to obtain top-k results in O(k) time. Varying K 13 Varying |Q| (in thousands)

Conclusion: • First to study a broad class of top-k pairs queries over sliding

Conclusion: • First to study a broad class of top-k pairs queries over sliding windows. • We present efficient algorithms and show that the performance of our algorithm is reasonably close to the lower bound cost. • We provide extensive experiment results on both real and synthetic data sets to show the efficiency and scalability of the proposed algorithms. 14

Question and Answer Thank You! Any Questions? 15

Question and Answer Thank You! Any Questions? 15

Related Work Top-k Query Processing • Fagin’s Algorithm (FA), threshold Algorithm (TA), no-random access

Related Work Top-k Query Processing • Fagin’s Algorithm (FA), threshold Algorithm (TA), no-random access (NRA) Top-k Pairs Queries Processing • k-closest pairs queries • k-furthest pairs queries • Top-k pairs queries [Cheema et al. , ICDE’ 11] Data Stream Processing • Top-k query processing over data stream [Mouratidis et al. , SIGMOD’ 06] • k-nearest neighbour queries [Böhm et al. , ICDE’ 07] 16

Experiments (Skyband Maintenance algorithm) Basic: maintening algorithm without K-staircase SCase: our algorithm using K-staircase

Experiments (Skyband Maintenance algorithm) Basic: maintening algorithm without K-staircase SCase: our algorithm using K-staircase to maintain the skyband. TA: Optimized algorithm for monotonic scoring functions. LB: show lower bound cost Varying K 17 # of attributes