Presented by Dardan Xhymshiti Spring 2016 Authors Sean

Presented by: Dardan Xhymshiti Spring 2016

Authors: Sean Chester* Darius Sidlauskas` Ira Assent* Kenneth S. Bogh* *Data-Intensive Systems Group, Aarhus University, Denmakr `Data-Intensive Applications and Systems Laboratory, EPFL, Switzerland Publication: ICDE 2015 Type: Research Paper 2

Skyline is expensive to compute especially in large datasets. The recent multi-core skyline algorithms does not effectively reduce the dominance tests. State-of-the art skyline algorithms outperform multi-core algorithms. Most of the multi-core Skyline algorithms use the Divide-&-Conquer approach which has two drawbacks: If the number of local skyline points is large, the merging step is expensive. Increasing the cardinality of the dataset, the computation becomes expensive. 3

To come up with a new multi-core algorithm, which eliminate as much as it can dominance tests. 4

Provide an overview about skyline operator. Introduces to the innovative skyline Hybrid algorithm. Provide experiments which shows that this algorithm outperforms multi-core and sequential algorithms. 5

How to increase the performance of skyline algorithms: Implementation in GPUs Implementation in Multi-Core CPUs. Implementing in distributing environments like Map. Reduce. The authors have developed an algorithm called: Hybrid The authors have chosen the Multi-Core CPU to do the implementation of the algorithm because of: Cheaper shared data structures and Parallel work need not be isolated. 6

7

Sort-based algorithms (quickly detect dominance relationships) SFS (Sort Filter Skyline) LESS Sa. LSa (Sort and Limit Skyline algorithm) Object-based space partitioning (quickly detect incomparability) Object-Space partitioning BSky-Tree-P 8

Presented by: Dardan Xhymshiti Spring 2016

Authors: Michael Shekelyam Gregor Josse Matthias Schubert Institute of Informatics, Ludwig-Maximilians-University Munich Publication: ICDE 2015 Type: Research Paper 10

In many application areas, data is organized as a network of graph. Important task: compute a cost-optimal path between a start node and a target one. Example: Road networks (Cost criteria: travel time, travel distance, energy consumption etc. ) Computer networks (Cost criteria: bandwidth and the latency between routers. ) Cost vector: when considering more than one criterion at a time, the cost of complete path is called cost vector. Cost criteria 1 Cost criteria 2 Cost criteria 3 11

How to define if a path is an optimal path? 1. Map the cost vector to a value by employing a monotonic combination function, and then sort the paths. The top n paths are the optimal ones. Problems: a) Hard to find a suitable function, b) Different types of cost might have different levels of scale. 2. Compute the pareto optimal (mathematical definition of Skyline) cost vectors. (This is also known as conventional path skyline). Problems: a) The number of pareto optimal paths might increase exponentially as function of distance and the amount of considered cost criteria. b) Showing to the user a large amount of results is not very helpful. 12

There actually exist solutions for computing linear path skylines, but they are restricted to the specific case of having just two cost criteria. Problem: cannot be generalized to more criteria. The number of skyline paths increases exponentially with the distance between the locations and the number of cost criteria. Thus the result set might be too big. 13

Come up with a new approach of computing the results set of path skylines, which provides better and faster results. 14

Recall: Conventional path skyline computes all potential optimal paths, but the result is too big. Idea: reduce the result set, to only show the paths which are optimal under a weighted sum function or linear combination of cost criteria. Intuitively saying: The user weights each type of cost with a percentage describing its importance. How to compute the linear path skyline? Naive approach: compute the conventional path skyline and then compute the convex hull on the resulting cost vectors. (Inefficient). 15

What is Convex Hull? Definition: In mathematics, the convex hull or convex envelope of a set X of points in the Euclidean plane or Euclidean space is the smallest convex set that contains X. 16

The authors come up with an algorithm called LSCH (Linear Skyline Convex Hull) which constructs a linear path skyline successively while only adding new paths which are members of the result set. Implementation overview: 1. To add a new cost vector, a single search is performed which combine the cost vectors based on the normal vectors of the hyper planes currently limiting the linear path skyline. 2. To identify the areas on the linear skyline where additional results might still exits, the algorithms applies multidimensional convex hull computation. 17

Experiments are run in two different types of networks: 1. Munich road network with five cost criteria. 2. Artificial lattice graphs that allow to simulate different problem instances and parameter settings. 18

Computing route Skylines algorithms. Convex hull algorithm. 19

Presented by: Shahab Helmi Spring 2016

Authors: Publication: ICDE 2015 Type: Research Paper 21

A recent study suggests that the routes provided by a leading navigation service often fail to agree with the routes chosen by local drivers, why? Limited number of travel costs, e. g. , distance or travel time. With the rapid development and continuing use of vehicle tracking technologies, it is possible to learn and update individual drivers’ driving preferences according to their trajectories. 22

It proposes a novel problem on personalized route recommendation based on big trajectory data. It proposes techniques to model and update driving preferences from drivers’ trajectories. The proposed driving preference model can support arbitrary number of travel costs of interest and distributions of cost ratios. Comprehensive experiments were done conducted on a substantial, real trajectory data set to show efficiency and effectiveness. 23

1. Indexing the road network. 2. Modeling drivers’ preferences from their trajectories considering multiple travel costs. 3. Selecting sub-trajectories according to source, destination, departure time and driver’s preferences. 4. constructing a small graph (Zohreh) with appropriate edge weights reflecting how the driver would like to use the edges based on the selected trajectories. 5. Returning the shortest route in the small graph as the personalized route to the driver. 24

Route Planning Using Trajectories: no driver modeling: Most popular route (MPR) Time period-based most popular route (TPMPR) Top-k popular routes (TKPR) Personalized Route Planning: TRIP: closest work, tested over a smaller dataset, can only model travel time 25

GPS records: 52, 211 taxis in Beijing. during 2012 -09 -30 to 2012 -11 -30. One GPS record is collected in every 5 seconds or less. Road network: 6 th street in Beijing. 28, 342 vertices and 38, 577 edges. 60 km square region. 26

Trajectories: 32, 379, 248 trajectories. starts when a passenger got in the taxi and ends when the passenger got off the taxi. Travel Costs: Travel distance. Travel time. Fuel consumption. 27

Presented by: Shahab Helmi Spring 2016

Authors: Publication: ICDE 2015 Type: Research Paper 29

Episode: an episode (serial episode) is usually defined as a totally ordered set of events that occur relatively close to each other. Frequency of an episode: how frequently it occurs in a sequence. Frequency count methods: Window-based Non-overlapped occurrences Non-interleaved occurrences Total Frequency Minimal Occurrences: can capture the most intense correlation between events. 30

31

32

There is no previous work of online FEM. Frequent episode mining on event sequences (main difference is the frequency count method): Alarm sequences in telecom networks Web navigation logs Timestamped fault reports in car manufacturing plants Sales transactions and stock data News Breath first approaches (Apriori-based) Depth first approaches (prefix tree) 33

Online frequent itemset mining (approximate and exact methods): MOMENT Can-Tree SWIM FP-Stream FDPM … They does not apply on episode mining since in frequent itemset mining the time information is not important while episodes are ordered according to their occurrence time. Hence, keeping the tree up-to-date is hard. Using the last occurrence concept it becomes more efficient. 34

Server 1 (for algorithm): Intel Xeon E 5 -2620 2. 00 GHz processor 32 GB memory Windows server 2008 Server 2 (for My. SQL database): Intel Xeon E 5 -2620 2. 00 GHz processor 16 GB memory Linux 35

Datasets 36

Results 1. 2. Online mode: MESELOBS is similar to MESELO but does not use the concept of last episode occurrence. Min. Supprt Offline mode: the performance of MESELO is compare to baseline algorithms. • • X axis shows window size Min. Support is 10 Window size 37