Presented by Omar Alqahtani Fall 2016 Authors Publication
Presented by: Omar Alqahtani Fall 2016
Authors: Publication: VLDB 2016 Type: Demo Paper 2
What is Location. Spark? Related Work Overview and features of Location. Spark Data model and data type Spatial Queries Query Scheduler Query Executor Spatial Indexing Memory Management Evaluation 3
Location. Spark is a spatial data processing system built on top of Apache Spark. It provides spatial query APIs on top of the standard dataflow operators. Why don’t we use Spark directly? Well, 1. The lack of spatial indexing. 2. The inability to handle spatial data skew. 3. The lack of spatial query optimization. 4. unnecessary network communication due to spatial data overlap. 4
To achieve performance speedup on Spark, they introduce a range of optimizations: Global and immutable local spatial indexes over in-memory data e. g. , Grid, R-tree, Quadtree, and IR- tree, to support ecient spatial Automatic skew analyzer and handler Using of bloom filter to reduce the communication cost. 5
What is Location. Spark? Related Work Overview and features of Location. Spark Data model and data type Spatial Queries Query Scheduler Query Executor Spatial Indexing Memory Management Evaluation 6
Two categories: Systems using Hadoop such as [ 4, 13 ], Hadoop-GIS, Spatial. Hadoop, MD-Hbase, But, Hadoop Map. Reduce has to write intermediate data into HDFS. Systems using Apache Spark such as Geo. Spark, Spatial. Spark, Magellan, and Geo. Trellis. Mostly suffered from query skew, and excessive and unoptimized network and I/O communication. 7
What is Location. Spark? Related Work Overview and features of Location. Spark Data model and data type Spatial Queries Query Scheduler Query Executor Spatial Indexing Memory Management Evaluation 8
Data model and data type: Location. Spark stores spatial data as key-value pairs. Key can be a two-dimensional point, a line-segment, a poly-line, a rectangle, or a polygon. The value type can be specified by the user such as a text. Spatial queries: It supports spatial range, spatial k. NN, spatial-join, and k. NN-join. It provides analysis functions including spatial data clustering, spatial data skyline computation and spatio-textual topic summarization. 9
What is Location. Spark? Related Work Overview and features of Location. Spark Data model and data type Spatial Queries Query Scheduler Query Executor Spatial Indexing Memory Management Evaluation 10
Skew is a major issue in spatial data. Focusing on two types of skewness: Unbalanced data partitioning: solved by spatial indexes. Query skew: solved by query scheduler. How ? 1. Dynamically collecting statistical information from each partition ( # of queries ) 2. A cost model is used to evaluate the overhead of repartitioning the hotspot partitions. 3. It can choose a set of partitions to be further reallocated to workers with an affordable cost. 11
What is Location. Spark? Related Work Overview and features of Location. Spark Data model and data type Spatial Queries Query Scheduler Query Executor: Spatial Indexing Memory Management I DIDN’T GET IT!!!!!!!!! Evaluation 12
Global Index Local Index It, first, samples the data to learn the distribution. The type of the local index can be specified by users. Then, it builds the global index. Grid and region quadtree are used for global index. It offers grid local index, an R-tree, a variant of the quadtree, or an IR-tree. 13
To support data update, each version of spatial index can be persistent to disk for fault -tolerance. Thus, these spatial indexes are immutable and are implemented based on the path copy approach. 14
Bloom filter, in general, is a probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not. * They embedded into the global spatial index a spatial bloom filter which can answer whether a spatial point is contained inside a spatial range or not. HOW ? * https: //en. wikipedia. org/wiki/Bloom_filter 15
What is Location. Spark? Related Work Overview and features of Location. Spark Data model and data type Spatial Queries Query Scheduler Query Executor Spatial Indexing Memory Management Evaluation 16
It dynamically caches frequently accessed data into memory, and stores the less frequently used data into disk. How? 1. Access frequencies and corresponding time stamps are recorded in the spatial index. 2. Then, it aggregates access frequencies. 17
What is Location. Spark? Related Work Overview and features of Location. Spark Data model and data type Spatial Queries Query Scheduler Query Executor Spatial Indexing Memory Management Evaluation 18
Two real spatial datasets are used: Twitter dataset, gathered from January 2013 to July 2014). the size is 250 GB. Open. Street. Map Contains spatial object with its coordinates (longitude, latitude) and an object ID. It contains 1. 7 Billion points and takes 62. 3 GB of disk space. Experiments done on a cluster that consists of : 6 Dell compute nodes with two 8 -core Intel E 5 -2650 v 2 CPUs, 32 GB of memory, 48 TB of local storage per node. It has Spark 1. 5. 0 with Yarn cluster. 19
Experiments compare the performance of Location. Spark with Geo. Spark, and Spatial. Spark. 20
21
- Slides: 21