Department of Computer Science A KeyValuebased Persistence Model

Department of Computer Science A Key-Value-based Persistence Model � for Sensor Networks By: Marcello Alves de Sales Junior Masters of Science in Computer Science Advisor: Prof. Arno Puder, Ph. D. Committee Chair: Prof. Marguerite Murphy, Ph. D.

Outline 1. Motivation and Literature Review 2. A Taxonomy for Data Persistence 3. Net. BEAMS: A Case Study 4. Empirical Analysis for Technology Selection 5. DSP Data Persistence: Design and Architecture 6. � Experimental Results: Correct Behavior and Performance 7. Conclusions and Future Works

1. Motivation • Persistence for Net. BEAMS Sensor Network Infrastructure • Component-based sensor network for environmental monitoring (Puder � (Puder et al) • Biologists are the main users of the system • What types of database systems? • Use the traditional relational data model? • Bai et al proposes programming languages (domain specific)

1. State of the Art in Data Persistence • Sensor Networks [� Akyildiz et al] • Infrastructure (Topology) and Node Types • Size and Lifetime [wp] • Sensor Networks Nodes [� Romer et al] • � Deployment and Mobility • � Size, Resources, Cost, Energy, Heterogeneity • � Communication Mode • Coverage and Connectivity • � [snc]

• � Persistence � Storage for Sensor Networks • How the Collected Data is Used � • Real-time Data Stream • � Data Archival • � Storage Location for Collected Data • � Local or External • Data-Centric • � Query Processing Used • � In-Network • Centralized • Data Volume Produce [ns]

• � Data Models and Query Engines • Tabular Data Model: flat files • Comma-separated values • Text Comparison • No Index • � Relational Data Model: binary files • Data Normalization • � Structured Query Language (SQL) • Lee et al proposed new operators for SQL • Structured Data Model: structured XML documents • XML Schema: document structure • XML Xpath: data retrieval Data Sink Identif y Descri be Classif y Storage-Ready Database System

• �� How Collected Data is Described? • Data Stream: 334 55. 45 -23. 44 119. 394 44 1 22 | 5/ • Ledlie et al proposes the use of Data Provenance � • Metadata: Data about data • � What was collected? • Temperature = 54. 3 : data • Scale = � ’fahrenheit’ : metadata • �� When was the data collected? • � Valid Time = Collected at 10: 34 am • Transaction Time = Time of Arrival • �� From where was the data collected? • GPS Coordinates: (12. 342, -145. 304) • Site: ‘lower-pier’ : metadata

• � Problem: recent oil spill in the San Francisco Bay (Oct 2009) [sfb 09] • Correlations between the collected data and the oil spill • Describing historical data events “Junk” Data • Data Annotation • Liu et al annotates video frames from sensor cameras • Descriptive Metadata • You. Tube Video Tag� • Tags for Web 2. 0 [an]

2. Data Persistence in Sensor Networks: a Proposed Taxonomy

2. A Taxonomy for Data Persistence • � Taxonomy (Greek τάξις, taxis (meaning 'order', 'arrangement') and νόμος, nomos • Practice and science of classification • Represented by hierarchical diagrams • Relationships between the root and branches Type Taxon Taxa Taxon 1 Taxon 2 Taxon 3

Data Persistence Taxonomy Database System Organization Query Processing Mechanism Purpose of Collected Sensor Data In-Network Query Processing Real-Time Stream Use Distributed System Centralized Query Processing Replication Centralized System Database Partition Data Archival Data Volume Data Model Small Schema. Dependent Data Description Annotations Data Provenance What: Data Identity When: Time Dimension Relational Model Medium Schema-less Structured Model Tabular Model �ocation of the L Sensor Data Local Storage Where: Data Location Large External Storage Data-Centric Storage

3. Net. BEAMS: A Case Study

3. Net. BEAMS: A Case Study • Net. BEAMS: Data collection using Data Sensor Platform (DSP) • Automates operation of SF-BEAMS • SF-BEAMS: single-star sensor network – data archival • Nodes geographically fixed • Single-hop communication • Production intervals: 1, 6 or 15 minutes • Heterogeneous Devices • Coverage: Tiburon coast • 1 Data Sink (RTC Labs)

3. Net. BEAMS: A Case Study • Net. BEAMS Gateway Node • YSI Sonde + Gumstix Embedded System + GSM Modem Centralized Data Sink RTC Labs

• Device Used by Net. BEAMS • YSI 6600 EDS V 2: COTS Water Quality Monitoring • 13 Measurement parameters • 1 Year worth of raw data • Max 23. 99 Mb at 1/min • 483, 840 samples per year • 5 YSI in current deployment [ysi]

Purpose of Collected Sensor Data Real-Time Stream Use �ocation of the L Sensor Data Local Storage Data Archival Database System Organization Distributed System External Storage Replication Data-Centric Storage Centralized System Database Partition Data Model SF-BEAMS Classification Data Description Annotation Data Provenance What: Data Identity When: Time Dimension Query Processing Mechanism Schema. Dependent In-Network Query Processing Centralized Query Processing Relational Model Data Volume Where: Data Location Schema-less Small Medium Large Structured Model Tabular Model

• Net. BEAMS Data Collection Scenario Missing Component!!!! 12. 20 � 192 179 55 88. 40 0. 09 0. 084 0. 059 7. 98 -79. 6 99. 5 8. 83 0. 4 8. 7 Collected Data DSP Messages

Functional Requirements

Non-Functional Requirements • Open-Source • Free of charge • Easy to Scale (Data Partitioning) • Accessibility (API) • Cope with RTC Small Volume of Data

4. Technology Selection Empirical Analysis

4. Empirical Analysis for Technology Selection • Technologies used by the literature reviewed • My. SQL: Jacob Nikom used it in Linux cluster for sensor networks; • Tiny. DB: Madden et al and Lee et al used it or sensor networks; • DB 2: Sow et al used as a hybrid approach of XML and Relational models to persist and query biometric events; • mongo. DB: Buyya et al reported it in new trends in persistence in the cloud.

• Use traditional Relational Databases • Tony Bain questions the adoption of the Relational Model • Traditional approach: 30 years • Accommodates changes? • Try adding entities • Try adding properties • Changes to the schema • Maintain schema normalized • Change Software Layers

• Schema-less: Key-Value Pair Data Model • Data Collections: “denormalized” data • No Data Integrity = Data located on same physical space Annota tion t a v r e s Ob ce n a n e Prov

• KVP Databases: better supports horizontal data partitioning • Shenker et al surveys Data-Centric Storage • Targeted Query vs Global Query Projection Tiburon, CA Collected Sensor Data Region 1 – Master Shard Collected Sensor Data Region 1 – Shard 2 Berkeley, CA South Bay, CA Collected Sensor Data Region 2 – Master Shard Collected Sensor Data Region 3 – Shard 2 Count Operation

Selection Criteria

5. DSP Data Sensor Platform: Design and Architecture

5. DSP Data Persistence: Design and Architecture • Persistence Scenario for Net. BEAMS – Solution

• Data Model Design: mongo. DB Document Instance Where When Data Manipulation: Programming Language Abstraction • ”Dot Notation” What • sensor. location. latitude = 37. 89 • time. transaction = Dec 17, 2009 • observation. p. H = 7. 11

Adding DSP Data Component Adding mongo. DB

• Deployment of the DSP Data Persistence • As External Storage Single Server • As Data-Centric Distributed Server

6. � Experimental Results: Correct Behavior and Performance

6. � Experimental Results: Correct Behavior and Performance • Goal: Simulate RTC Environment • Experiment Setup - Infrastructure • Key-Value definition • Randomly Generated YSI Sonde Data (R 0) • Simulates Different Types of Storages using Virtualization; • Workload • Compatible data volume used by RTC • 1 YSI = 483, 840 documents = First Round • 5 YSI = 5 * � 483, 840 = 2, 419, 200 = Consecutive Rounds

• Scenarios • Use Cases as Agile User Stories – Persona, Action, Result • (R 1) ”As a marine biologist, I would like to search observations by filtering values of the sensor device’s properties such as water temperature and salinity on December 17, 2009, so that I can find associated values to the observation. ”; Programming Language mongo. DB Abstraction to Access Data db. Sonde. Data. Container. find( { observation. Salinity : 0. 01, observation. Water. Temperature : 46. 47, time. valid : new Date( 2009, 12, 17) } )

• Scenarios • (U 1) ”As a estuarine ecologist, I would like to annotate observations from the time the “oil spill” occurred in the San Francisco Bay, so that I can maintain historical evidence of the impact of such event. ” � Programming Language mongo. DB Abstraction to Access Data db. Sonde. Data. Container. update( { time. valid : { $gte: new Date(2009, 10, 12) , $lt: new Date(2009, 11, 13) }} , {$set : {tag: "oil spill"}} )

• Implementation of use cases API diversity; Scenari Languag o e Used Type mongo. DB Method Call Use Case Implemente d Create Java db. Sonde. Data. Container. inser t() R 0 Retrive Javascript db. Sonde. Data. Container. find() R 1, R 2, R 3 Update Javascript db. Sonde. Data. Container. updat e() U 1 Delete Javascript db. Sonde. Data. Container. remo ve() D 1 Export Python db. Sonde. Data. Container. find() R 4

• Implementation fulfills all the taxonomies • 1. 35 GB Claimed Disk Space • ~� 25, 091 Inserts/min • Retrieval ~milliseconds • Update Varies (Depends on Partition Size, Dataset) • Simpler Implementation of Use Cases • Data accessibility • Different APIs, different languages • Key-Value Data Model • No schema changes to modify data design • Trade-off between Disk Storage (commodity) and performance

• Data-Centric approach • Scales in terms of disk space available • Decreased processing time • Less data in a shard, faster query processing • Novel approach: alternative to existing ones • New Data Model Taxonomy

7. Conclusions and Future Works

7. Conclusions and Future Works • How Important is Data Collection • Environmental Sensor Networks: Hazard Alerts • How to describe data: Data Provenance guidelines • Important descriptions: annotations, tags • Contributions • Data Persistence in Sensor Networks Taxonomies • Novel Approach: KVP data model for sensor networks data • Implementation for External and Data-Centric Storages • Technology ready for Cloud Computing

• Future Works • Data-Centric Deployment with Map. Reduce Application • Sorting, subsets

• Future Works • RTC gathers data by time period; Data are mostly repeated • Wang et al surveyed efficient schedulers for Sensor Networks; • Yin et al and Chen et al showed the use of Data Clustering before sending data to data sink; • Creation of a DSP Data Clustering before persisting data; • Research Problems • In-network storage/query using KVP databases • Partitioned Data nodes Event-Based application developed on top of YSI Sonde Data • “observation. Battery” carries the battery life-time information;

References • Arno Puder, Teresa Johnson, Kleber Sales, Marcello de Sales, and. Dale Davidson. A component-based sensor network for environmen-tal monitoring. In SNA-2009: 1 st International Conference on Sensor. Networks and Applications, pages 54– 60, San Francisco, CA, USA, November 2009. The International Society for Computers and Their Applica-tions - ISCA. • � I. F. Akyildiz, Weilian Su, Y. Sankarasubramaniam, and E. Cayirci. A survey on sensor networks. Communications Magazine, IEEE, 40(8): 102– 114, Aug 2002. • K. Romer and F. Mattern. The design space of wireless sensor networks. IEEE Wireless Communications, 11(6): 54– 61, December 2004 • Seungjae Lee, Changhwa Kim, and Sangkyung Kim. New database operators for sensor networks. In SERA ’ 07: Proceedings of the 5 th ACIS International Conference on Software Engineering Research, Management & Applications, pages 689– 696, Washington, DC, USA, 2007. IEEE Computer Society. • Jonathan Ledlie, Chaki Ng, and David A. Holland. Provenance-aware sensor data storage. In ICDEW ’ 05: Proceedings of the 21 st International Conference on Data Engineering Workshops, page 1189, Washington, DC, USA, 2005. IEEE Computer Society. • Xiaotao Liu, Mark Corner, and Prashant Shenoy. Seva: Sensor-enhancedvideo annotation. ACM Trans. Multimedia Comput. Commun. Appl. , 5(3): 1– 26, 2009. • Jacob Nikom. Real-time sensor data warehouse architecture using mysql. In. My. SQL Users Conference. O’Reilly Media, Inc. , April 2005. • [sfb 09] Oil spills into s. f. bay south of bay bridge. http: //www. sfgate. com/cgibin/article. cgi? f=/c/a/2009/10/30/BA 9 B 1 ACTST. DTL, October 2009.

References • Daby M. Sow, Lipyeow Lim, Min Wang, and Kyu Hyun Kim. Persisting and querying biometric event streams with hybrid relational-xml dbms. In DEBS’ 07: Proceedings of the 2007 inaugural international conference on Distributedevent-based systems, pages 189– 197, New York, NY, USA, 2007. ACM. � • Samuel R. Madden, Michael J. Franklin, Joseph. M. Hellerstein, and. Wei Hong. Tinydb: an acquisition query processing system for sensor networks. ACMTrans. Database Syst. , 30(1): 122– 173, 2005 • . • Images • [sd] http: //www. zess. uni-siegen. de/ipp_home/ipp/research/master-student-topics/ • [snc] [snc http: //www. dei. unipd. it/~schenato/pics/Sensor. Network. jpg • [ns] http: //www. cc. gatech. edu/projects/disl/special. Projects/figure 1. gif • [an] http: //eurekr. com/pics/Annotatingan. Imagein. WPF_A 7 D 8/image. png • []ysi] http: //www. ckjorc. org/cn/admin/news/edit/Upload. File/200681616301130. jpg

Department of Computer Science A Key-Value-based Persistence Model for Sensor Networks Marcello de Sales Master of Science in Computer Science (msales@sfsu. edu) ? http: //code. google. com/p/netbeams http: //www. netbeams. org “The brick walls are not there to keep us out. The brick walls are thereto give us a chance to show badly we want something. Because the brick walls are there to stop the people who don't want it badly enough. ” Dr. Randy Pausch

DSP in practice = Net. BEAMS Use Cases • Data Payload for the YSI Sonde 6600 V 2 • Sonde. Data. Type: representation for the collected data • Sonde. Data. Container: collection of the collected data

Data Sensor Platform (DSP) Message Structure • DSP Message • Header • Producer • Consumer • Body • Message Content • DSP Messages Container • Package of DSP Messages

Data Sensor Platform (DSP) Communication Mechanism • DSP Broker • Local delivery • Remote delivery • Gateway Component • DSP Matcher • Filtering based on rules • Independent Per Host PRODUCER CONSUMER TARGET GATEWAY Component A - Component X - Component B Component C - Component Y Component Z 192. 168. 0. 11

DSP Data Persistence Component

DSP Data Persistence Component

3. Net. BEAMS: A Case Study DSP Data Persistence component Requirements • Open-Source • Support Data-Centric • Free of charge • Accessibility (API) • Cope with RTC Small Volume of Data

3. Net. BEAMS: A Case Study Missing Database System

• Data Sensor Platform (DSP): OSGi-based Components • Data Producers, • Data Consumers

4. Technology Selection Normalized Relational Model Example State Instance

4. Technology Selection based on Empirical Analysis • Optional Approach: Key-Value Pairs Approach = Sensor Attributes • Data Concentration in one Single Table • Does not Scale on Single Server • Hard to use

5. DSP Data Sensor Platform: Design and Architecture • Data Model Based on Data Provenance Taxonomy • �� What: identifies what was collected, tracks the DSP Message • � message_id • observation. [raw-data-1, raw-data-2, …, raw-data-n] where raw-data-i are the set of the attributes of a given sensor; • Where: defines the coverage of the data � • sensor • Ip_address • location • latitude • Longitude • � When: � tracks when the data was collected and downloaded • time • transaction, • fact

• DSP Mongo. CRUDService UML Class Diagram • CRUD Service • Create • Retrieve • Update • Delete • Dependencies on YSI Data Type

5. DSP Data Sensor Platform: Design and Architecture DSP Data Persistence Component Activator UML Class Diagram • Collected Messages delivered to the DSP Data Persistence • Component Bootstrap • Measurement Messages from sensor devices

5. DSP Data Sensor Platform: Design and Architecture DSP Data Flusher UML Class Diagram • Concurrent Thread • Are there transient messages to be flushed? • Collect them and send to the CRUD service

6. � Experimental Results: Correct Behavior and Performance • (R 3) ”As a marine biologist, I would like to search observations that took place last week, so that I can assess past environmental conditions”; db. Sonde. Data. Container. find( { time. valid : { $gte: new Date(2009, 11, 8) , $lt: new Date(2009, 11, 15) }} ) � • (R 2) ”As an oceanographer, I would like to search observations that took place at the geographic coordinates (37. 891611, -122. 446446), so that I can assess the area around the given coordinates. ”; db. Sonde. Data. Container. find( { sensor. location. latitude : 37. 891611, sensor. location. longitude : -122. 446446} ). � • (U 1) ”As a estuarine ecologist, I would like to annotate observations from the time the “oil spill” occurred in the San Francisco Bay, so that I can maintain historical evidence of the impact of such event. ” db. Sonde. Data. Container. update( {"time. valid" : { "$gte": new Date(2009, 11, 12) , "$lt": new Date(2009, 11, 13) }} , {$set : {tag: "oil spill"}}) � • (D 1) ”As an oceanographer, I would like to remove specific observations collected yesterday, so that the research group does not use ’junkdata’. ” (Delete) • (R 4) ”As a biologist, I would like to export the collected data produced during this month using the OPe. NDAP data format, so that I can collaborate with other research groups that use this data format. ”; (EXPORT)

6. � Experimental Results: Correct Behavior and Performance • Execution through script plus a Java class for data insertion • Measurements collected to log files with provenance information • Execution time, memory used • Execution steps • Database server and client execution • Measured Results • Claimed Disk Space: Indexed, Long Keys � 1 YSI � 5 YSIs 278. 33 MB � 1. 35 GB � • Insertion Average: decreased after the execution of the third round from ~� 25, 091 to ~� 8, 013 documents per minute

DSP Data Persistence Component
- Slides: 61