A SamplingBased Approach to Optimizing Topk Queries in

A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks Adam Silberstein Rebecca Braynard Carla Ellis Kamesh Munagala Jun Yang Duke University

Querying Sensor Networks • Sensors provide a huge amount of data • Nodes are energy-constrained – Radio used to transmit data – Radio is primary energy consumer Base Station Ø Challenge: use data to answer queries, but while limiting size and number of messages • Allowing approximation has great potential

Outline 1. 2. 3. 4. Top-k Query Challenges Sampling-Based Optimization Prospector Algorithms Experimental Evaluation

Top-k Query • Top-k Query Definition – Return the nodes with the k highest sensor readings – Approximate version: Given an energy budget, gather node values such that number of top-k values is maximized • Example – Get list of most popular bird feeding sites so ornithologists can visit those areas

Top-k Challenges Ø Key Point: Unlike selection queries, whether a node in the solution depends on other node values • Cannot simply rank nodes by model means • Calculating the likelihood a node is top-k requires joint comparison of all node models – Not easy even when models are simple – Exponential number of cases where each node is in top-k

Top-k Challenges • Assume we know probability a node is in top-k – Still other issues influencing an acquisition plan – Local filtering • Greedily choosing top probability nodes not necessarily best strategy • If no more than 2 top-k nodes ever come from subtree rooted at u, don’t assign greater than bandwidth 2 out of subtree – Constructing Proof • Additional goal may be to prove as many top-k values as possible • Must return non solution values as proof

Terminology • Network of n nodes, u 1, …, un • Nodes arranged in routing spanning tree, Τ – Parent of ui, parent(ui) – Child nodes of ui, children(ui) • Cost model – Overhead cost per edge: σ – Marginal data cost per edge: δ BS A B Cost: (3σ)+(4δ) BS A C (4σ)+(4δ) Ø A top-k plan is an assignment of bandwidths to edges dictating data collection Ø Total bandwidth cost must be less than userspecified energy budget

Sampling-Based Optimization • Base optimization on set of samples drawn from joint distribution of nodes node 1 2 3 4 5 1 3 Sample # BS 2 4 5 1 2 3 node 1 1 3 2 22 3 4 2 14 19 26 3 5 7 9 4 16 12 10 5 10 17 20 Sample # node 1 1 0 2 1 3 0 2 1 1 1 3 0 0 0 • Mark which nodes are top-k in each sample • Acquire as many “ 1”s as possible 4 1 0 0 5 0 0 1

Theoretical Basis for Sampling • Optimize over the samples • Polynomial number of samples sufficient for prediction • Approach proven for class of 2 -stage stochastic optimization problems [Shmoys and Swamy FOCS 04] • Proof for approximate top-k is a reduction from stochastic steiner-tree

Sampling Advantages • Work directly with sample values – Avoid having to learn and fit models to data • Any correlation is implied within samples – If model is available, can draw values from it • Samples maintenance then no more expensive than explicit model maintenance – In top-k, avoid calculating probabilities of nodes being in result

Prospectors • Prospecting: Dig for answers in most promising places, given energy budget • Prospector. Greedy – Sum sample columns – Visit most frequent contributors Sample # node 1 1 0 2 1 3 0 1 2 1 1 1 3 3 0 0 4 1 0 0 1 5 0 0 1 1

Prospectors • Prospector. LP-LF – Consider bandwidth cost model • Encode with linear programming – Each visited node necessitates each ancestor edge must be used, with additional marginal cost of a single (node, value) pair Ø Now we favor visiting nodes clustered in same area of network!

Prospectors • Prospector. LP+LF – Add in local filtering – Each visited node doesn’t necessarily contribute a (node, value) pair to each ancestor edge message • i. e. , b(ui) != Σ b(child(ui))+{0, 1} b: bandwidth – More flexibility • Acquire all the subtree’s top-k values, but with lower outgoing bandwidth on ancestor edges • If a subtree never contributes more than x, even if particular nodes vary, assign no more than x bandwidth

Prospectors Greedy LP-LF 1 5 1 1 1 a LP+LF 1 1 1 a 5 2 3 a a 1 1 1 a a a 2 1 a 2 a a 1 1 1 a a a 2

Prospectors • • Previous Prospectors cannot guarantee anything about their result Prospector. Proof – Pass up additional values to prove other values are the top-l in a subtree • Sort all values passed to subtree root • vi is top-l for the subtree if: 1. It is within the top-l of the sorted list 2. A proven value from each child subtree besides vi’s is in list below top l, or all values from subtree are in list Ø Augment with 2 nd querying stage to get remaining top k • Prospector. Exact

Experimental Evaluation • Comparison of Algorithms • Results of Energy Cost vs. Accuracy – Oracle: Knows top-k nodes in advance – Naïve: Exact solution for different k values – Greedy, LP-LF, LP+LF: Prospectors

Experimental Evaluation • Contention Zones – Demonstrate local filtering – Network areas with negative correlation – Each zone has k low mean, high variance nodes such that each provides 1/6 of top k – LP+LF visits more nodes per same energy budget

Experimental Evaluation • Temperature readings at Intel Lab Berkeley data set • Local filtering doesn’t help, no negative correlation

Experimental Evaluation • Given enough energy for Prospector. Proof (phase 1), Prospector. Exact total cost is less than naïve-k

Conclusions • Approximation is an important technique for limiting network transmissions and saving energy • Model-driven acquisition is good for building query plans, but not straightforward for complex queries • Top-k is a fundamental query and illustrates complications for using models • We use sampling-based models and associated linear programs to build query plans, incorporating sophisticated features • Experimentally – Approximate solutions are much less expensive that exact – The more sophisticated the planning program, the more the plan can leverage the network and data characteristics to visit more nodes per same amount of energy

Experimental Details • • • 200 nodes k=40 200 x 400 area, 50 m range Berkeley, k=12, of 54 nodes Naïve top-k on Berkeley, 907 m. J for 100%, 623 m. J for 20%. So 3 x more expensive than approx at near 100% • Fixed Cost: 40, Marginal Cost: 18