Design Lessons from an Unattended Ground Sensor System







































- Slides: 39
Design Lessons from an Unattended Ground Sensor System* †Center for Embedded Networked Sensing, girod@cs. ucla. edu *Work done at Sensoria Corp, supported by DARPA/ATO contract DAAE 30 -00 -C-1055. Sensoria team: R. Costanza, J. Elson, L. Girod, W. Kaiser, D. Mc. Intire, W. Merrill, F. Newberg, G. Rava, B. Schiffer, K. Sohrabi 1 Lewis Girod† CS 294 -1 23 Sept 2003
Introduction l l Networked embedded systems come in a variety of sizes RAM is a primary tradeoff – – – 2 Costs power, cost to shutdown Enables greater complexity Enables development while postponing optimization SPEC (J. Hill): 4 MHz/8 bit, 3 K/0 K Mica 2 (Berkeley/Xbow): 8 MHz/8 bit, 4 K/128 K MK 2 (UCLA/NESL): 40 MHz/16 bit, 136 K/1 M SHM (Sensoria): 300 MIPS+FP/32 bit, 64 M/32 M
Motivation for Em. Star l Em. Star is a run time environment designed for Linuxbased distributed embedded systems – – – l Many of the ideas and motivations for Em. Star derived from our experience with SHM – – l Modularity, robustness to module failure System transparency at low cost to developers Some parts of Em. Star are used in SHM & elsewhere – 3 Useful facilities (process health/respawn, logging, emulation) Common APIs (neighbor discovery, link interface, etc) Designed for larger memory footprint (avoids hard limits) – Time Synch service Audio Server
System Objectives and Design 4
Objectives l Unattended Ground Sensor (UGS) System l Fully autonomous operation Ad-hoc deployment l l Scaling unit: 50 nodes – – 5 * All operations and protocols local to 50 node region No global operations or context required *Fancy graphics taken from the official SHM website
Adaptive and Self-Configuring l Self-localizing without GPS*: acoustic ranging – – Build map of relative locations Adaptive/Resilient to environmental conditions l l Self-assembled data network – TDMA MAC layer l – 6 e. g. wind, sunny days, background noise Typically 10 -hop diameter with 100 nodes Adaptive/Resilient to RF environment *In this application, GPS is avoided for security reasons. In other applications, obstructions and foliage can be an issue
Maintain coverage via Actuation l l 7 Vigilant units detect failed unit(s) Remaining units autonomously move in to maintain coverage
Demonstration Requirements l l l 200 x 50 m outdoor field 100 nodes, 10 m spacing Sunny afternoon – l l l 8 85°F, 20 MPH wind No preconfigured state GPS-free relative geolocation to 0. 25 m Detect downed nodes, move to maintain coverage within 1 min
SHM Project Design Choices: Optimize for rapid development l l Concurrent HW/SW development Compressed schedule – – l Complex software required – – – l 150 K lines C code 30 processes 100 IPC channels Power is not the driving constraint – – 9 Aggressive scaling milestones Logistical problems with debugging system of 100 nodes in 200 x 50 m field Continuous vigilance, rapid response are project requirements System lifetime target: < 1 day
System Configuration l l l l 10 300 MIPS RISC processor with FPU 64 M RAM / 32 M Flash 2 50 kbps 2. 4 GHz data radios, TDMA, frequencyhopping, star-topology MAC, 63 hopping patterns 4 channels full-duplex audio 3 -axis magnetometer / accelerometer 2 mobility units, with integrated thrusters Linux 2. 4 kernel Optional wired ethernet (for Devel/Debug only)
Results: Acoustic Ranging l l l Ground truth was handsurveyed, +/- 0. 5 m Ranges not temperature compensated in demo Ranges with angles are more accurate – – 11 Angle from TDOA of two or more ranges, must be consistent Bug discovered after the fact, caused large errors
Results: Radio Utilization l l Graph shows traffic at three bases over a complete run Initial spikes: – – l Quiescent rates – – – 12 Tree formation Lots of ranging Heartbeats to detect down nodes Maintenance of trees and location Reaction to dynamics
Challenges in Implementation l Dealing with a dynamic environment – l Dealing with noise – – l – – Node failure Infrequent crashes (e. g. FP exceptions from transient bad data) Fault tolerance at process boundaries, avoid ripple effect Dealing with complexity – – 13 Rejecting outliers from timesync, ranges, angles Filtering neighbor connectivity, insignificant changes to range/angle Dealing with failure – l Adapt to wind, weather, RF connectivity Cross layer integration vs. modularity. . Or both? What are the right set of primitives for coordination?
Case Study: Acoustic Ranging 14
Basic TOF Ranging l Basic idea: – – Sender emits a characteristic acoustic signal Receiver correlates received time series with time-offsets of reference signal to find “peak” offset Degree of correlation as a function of time offset Amount of Correlation T 15
Basic AOA Estimation 16 • 16 possible paths • First pick best speaker • Then estimate angle from TDOA of one or more consistent ranges
Acoustic Ranging, Version 1 First cut implemented explicit cluster coordination protocol – – l Did not scale past 20 nodes – – – l Can’t range across clusters Best acoustic neighbors may be in other clusters MLat merging algorithm is error prone Overuse of flooding – 17 Lots of error cases to handle, hard to handle all efficiently Very timing sensitive (sync) Soft state reflood of cluster MLats and orientation data Field Table orientation Flooding l Merging mlat orientation Orientation Cluster Mlat (CH only) mlat Acoustic Signal AR Neighbor Discovery Radio 0 Neighbor Discovery Radio 1
V 2: Decomposing AR Reliable State Sync Flooding w/ Hop by Hop Time Conv AR Timesync Service 18 Audio Server
Audio Sample Server l Continuously samples audio, integrates to Timesync – – l Buffers past N seconds, exposes buffered interface – – l Data access can be triggered after the fact: relaxes timing constraints on trigger message Can process overlapping chirps by requesting overlapping retrievals, rather than having to pick one and ignore other Enables access so audio device from multiple apps – 19 Eliminates error-prone “Synchronized start” Enables acquisition of overlapped sample sets Ranging can coexist with acoustic comm subsystem* *Acoustic comm was developed as a backup channel to be used in event of RF jamming
Inter-node Timesync: RBS l Key idea: – – – Receiver latency more deterministic than sender Thus, common receivers of a sender can be synched by correlating the reception times of sender’s broadcasts It’s your only option if you don’t control the MAC Senders* Sync to green nodes Sync to purple nodes Sync to P&G nodes 20 * For sender sync, senders must be in some other sender’s broadcast domain
Time. Sync Service l Inter-node Sync: Implementation of “RBS” – – l Computes conversion params among all nodes in each cluster CH does computation, reports parameters to CM’s Intra-node Sync: Codec sample clocks – Clock pairs reported by audio server l – Outlier rejection and linear fit to find offset and skew estimate l l Yields more consistent result than “synch start” Multihop Time Conversion – – 21 Map time of DMA interrupt to sample number Graph of sync relations through system Conversion from one element to another requires path through graph. Gaussian error at each step ~sqrt(hops)
Hop-by-hop Time Conversion l Problem: – – Nodes have ability to convert within cluster but not outside Could continually broadcast conversion parameters… BUT l l l Solution: Integrate time conversion with routing – – Routing layer knows about packets that contain timestamps* Convert timestamps en route l l – – 22 They are continuously varying Large amount of data to transmit across network At cluster boundaries At destination node Integrated with flooding Can fail if sync graph route Timebase: A Time: 144 A *Unclear what the right API is here: we simply added code to flooding. Timebase: C Time: 732 B C Timebase: E Time: 234 D E
Reliable State Synchronization l Problem: – – l Need to reliably broadcast the latest range data to N-hop away nodes, so they can build a consistent coordinate system Should have reasonable latency and low overhead V 1 addressed this problem with periodic refresh – Cluster heads retransmit Mlat tables every 15 seconds l – Traffic load forced new protocol: l l – l 23 Problems: Traffic load from redundant sends, latency on msg loss Send a hash when there was no change since last refresh If the hash has not been seen, request full version But, still has 15 second latency on lost data V 2 introduced a “Reliable State Sync” protocol (RSS)
RSS Design l Semantics: – – l Reliably converges on latest published state Does not guarantee client sees every transition Robust and Efficient, structurally similar to SRM/wb: – Based on reliable transfer of a sequenced log of “diffs”. l – – Per-source forwarding trees (MST of connectivity graph) Local repair, up to complete history, from upstream neighbor l 24 Pruning of the log is done with awareness of log semantics (replaced or deleted keys are pruned) New or restarted nodes will download all active flows from upstream neighbor
RSS API l l l Node X publishes current state as Key-Value Pairs Diffs are reliably broadcast N hops away from X Each node within N hops of X eventually sees the data X published State Sync “Bus” l l 25 1 2 3 4 1: A=1 1: B=2 2: A=3 2: C=4 Note: 2 -hop publish from 1 doesn’t reach 4 API presents each node’s KVPs in its own namespace Caveat: transmission latency, loss, edge of hopcount can cause transient inconsistencies
Putting it back together: AR V 2 Reliable State Sync “bus” types/orient ar/mlat orient_bcast ar_send Audio Data ar_recv audiod audio/[02]/sync_bin Chirp Notification 26 types/ranges Acoustic Signal orientd Software Module Output Device Message/Device File Distributed Module types/orient mlatd ar/request_chirp orient/pub_seq orient/status types/ranges syncd Conversion Chirp Notification Flooding with hop by hop time conversion
But, there are many error cases: • REQ lost? • ACK lost? • Bcast lost to some receivers? • Bcast delayed in queue? • Bcast lost to sender? • CMs join two clusters; may be busy ranging in other cluster. • Inaccurate codec sync start? • Interference in acoustic channel? • Reply from sender lost? Cluster Member • Reply from receiver(s) lost? (Receiver) • How long to wait for stragglers? • CH failure loses all ranges for cluster AR (CM) AR V 1 Event Diagram mlatd (CH) Cluster Head (Coordinator) Cluster Member (Sender) AR (CH) AR (CM) If not enough data for cluster mlat, request ranging to specific missing cluster members Send Range REQ to first CM in round robin order, check busy ACK with preferred start time Bcast Range Start, specify code Timestamp msg arrival. Sender delays before starting to ensure rough sync, and reports exact time offset from bcast to codec start Run correlation, report time offset from bcast to detection in data CH waits for stragglers 27 Report new ranges and notify mlat when round robin thru CMs completes Acou stic S ignal Reliability challenges: The sender is the linchpin: an error in sender sync affects all ranges to receivers, and replies from receivers can’t be interpreted without the sender reply. If connectivity to sender is bad and the broadcast is lost, all receivers waste CPU on a useless correlation. Implementing reliable reporting is made more complex because retx’d receiver replies must be matched to a past sender. Big complexity increase to…: • Range across clusters • Coordinate adjacent clusters • Do regional mlats • Average multiple sync bcasts
What can go wrong here? • Collisions in acoustic channel. • Flooded message delayed beyond audio buffering (16 seconds). • Flooded message dropped for lack of sync relations along route. • Node restart causes ranges to/from that node to be dropped. AR V 2 Event Diagram Waiting for enough data to compute mlat Waiting for chirp request Continuous Sampling Continuous Sync Maintenance Waiting for chirp notification mlatd ar_send audiod syncd ar_recv If not enough data for mlat, request chirp and wait for a while Chirp audio (audiod on remote node records it) Flood Chirp notification message, with hop-by-hop conversion at flood layer Retrieve samples from buffer and correlate in separate thread Publish new range to N hop away neighbors 28 Try mlat again with new data in separate thread Acou stic S ignal Key design points: • Encapsulate timing critical parts, no timing constraints on reliability. • If a receiver can’t sync to sender it won’t attempt correlation.
Key Observations l l l 29 No coordination required Simplifying transport abstractions Continuous operation and service model
Key Observations l No coordination required – If mlatd doesn’t have enough data it triggers chirping to start generating more data l l – ar_send & ar_recv are slaves to request and notify messages. l l 30 Exponential backoff on chirping with reset when data is lost. Simplicity of system lets designer focus on these details Transparently, ar_recv can receive overlapping triggers and buffer the data for correlation Priority scheme decides the best order to process queued correlations, based on past success/failure and RF hopcount Simplifying transport abstractions Continuous operation and service model
Key Observations l l No coordination required Simplifying transport abstractions – – Flooding takes care of delivering a local time State Sync provides consistency for data input to mlatd l l 31 Efficiently supports a potentially large number of keys (~1000), enabling full regional mlat at each node (no “merging”) Mlat takes 10 -15 min, sync is consistent on that timescale Failure of one node only loses range data for that node Continuous operation and service model
Key Observations l l l No coordination required Simplifying transport abstractions Continuous operation and service model – – Eliminates many inconsistencies and corner cases Reduces the number of states or modes l – – 32 Simplifies interfaces to services Recovery from faults without coordination – just wait for stuff to start working again Service model supports multiple apps concurrently
The Catch l Of course, the catch is power consumption – – l Continuous operation can be wasteful Modularity can be less efficient than cross-layer integration Interesting questions: – How much is gained by fine-grained shutdown, plus the added coordination overhead, relative to more coarse grained shutdown and periods of continuous operation? l 33 For instance, the AR system could shut down after generating an initial map, and only wake up when something moves.
The End! For more information on Em. Star, see http: //cvs. cens. ucla. edu/emstar/ 34
Design Evolution l Initial design strategy: shortest path first – – l Redesign: – – 35 Modular decomposition according to best guess at time Making a full-blown, generalized service is much more work than a one-off feature – so tradeoff considered case by case Problem: As more is learned these tradeoffs fit more poorly Unmanageable complexity to address problems Factor out common components Plan for known scaling problems Remaining modules are of manageable complexity, yet usually achieve a more complete and correct implementation More sophisticated inter-module dependencies
Radios l l l Each node has two radios TDMA, frequency hopping radios 63 hopping patterns – – – l Base/Remote (star) topology – 36 Each radio can lock to one pattern Patterns are independent “channels” Bases on same pattern tend to be desynchronized Base synchronizes TDMA cycle, remotes join
TDMA Slot Scheme l Each frame contains 1 transmit slot for the base and 1 transmit slot for each remote – l Slot size implies MTU Frame size is a constant – – Base slot size is fixed: 70 byte MTU Number of remotes inv. prop. to remote MTU l Practical MTU (40 bytes) 8 node clusters 14 ms 37 S Base Remote 1 Remote 2 Remote 3 Base Slot
Packet Transfer l Broadcast capability – – l Link layer retransmission – 38 Base can use its slot to send a broadcast to all remotes, or a unicast to a single remote Remotes can send only unicasts to base MAC implements link layer ACKs for unicast messages, and configurable retransmission
Breach Healing In this application, “healing” is intended only to address breaches created by dying nodes, not preexisting breaches. Other algorithms might also be useful, e. g. density maintenance, but were not implemented here. 39