Collective Spatial Keyword Queries A Distance OwnerDriven Approach

  • Slides: 36
Download presentation
Collective Spatial Keyword Queries: A Distance Owner-Driven Approach Cheng Long, Raymond Chi-Wing Wong: The

Collective Spatial Keyword Queries: A Distance Owner-Driven Approach Cheng Long, Raymond Chi-Wing Wong: The Hong Kong University of Science and Technology Ke Wang: Simon Fraser University Ada Wai-Chee Fu: The Chinese University of Hong Kong 1 Prepared by Cheng Long Presented by Cheng Long 1 July, 2013

Outline n n n n 2 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co.

Outline n n n n 2 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co. SKQ Experimental Results Conclusion

Introduction: Spatial-textual data n n Spatial + textual Some examples n Spatial points of

Introduction: Spatial-textual data n n Spatial + textual Some examples n Spatial points of interest n n Geo-tagged web objects n n 3 E. g. , Geo-tagged photos at Flicker and geo-tagged docs. Geo-social networking data n n E. g. , restaurants, shopping malls and hotels. … E. g. , In Four. Square, each user has its checking-in history and profile.

Introduction: Spatial Keyword Queries n n n Data: A set of spatial-textual objects Input:

Introduction: Spatial Keyword Queries n n n Data: A set of spatial-textual objects Input: a query location and several query keywords Query goals: Spatially close & textually similar n Spatial Keyword Top-k query / Reverse top-k query n n Spatial Keyword k. NN query n n 4 Keyword covering constraint. Spatial Keyword Range query n n The score function … Keyword covering constraint.

Introduction: Collective Spatial Keyword The query Queries keywords are diverse. n The no. of

Introduction: Collective Spatial Keyword The query Queries keywords are diverse. n The no. of query Spatial Keyword k. NN query / range query keywords is large. n n n A single object covers all the keywords. Not always possible! Collective Spatial Keyword Query (Co. SKQ) n n By Cao et al. SIGMOD’ 11 It finds a set of objects that n n n covers the query keywords collectively; has the smallest cost. Cost Functions n n 5 … Linear. Sum: Max. Sum: Linear. Sum-Co. SKQ Adequately solved! NP-hard! Max. Sum-Co. SKQ Cao-Exact: Scalability issues!

N 1 N 2, N 3 N 2 The query keywords are t 1,

N 1 N 2, N 3 N 2 The query keywords are t 1, t 2. Each inner node covers both t 1, t 2. N , N , Introduction: Motivation N N N Enumeration n 4 4 5 5 6 6 7 8 7 8 o 5, o 10 o 5, oo Cao-Exact. o 6, o 10 Best-first search algorithm based on IR-tree. Not scalable! 8 M objects, 6 query keywords: more than 10 days! n n n IF 2 IF 3 IF 1 t 1: N 2, N 3 t 2: N 3 … N 1 N 2 N 4 IF 4 6 o 6, o 9 N 5 N 3 N 6 IF 5 IF 6 N 8 N 7 IF 7 IF 8 t 1: o 2, o 3 t 2: o 3 …

Outline n n n n 7 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co.

Outline n n n n 7 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co. SKQ Experimental Results Conclusion

Contributions n 8

Contributions n 8

Outline n n n n 9 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co.

Outline n n n n 9 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co. SKQ Experimental Results Conclusion

Problem Definition (1) n q: the query n n n O: the set of

Problem Definition (1) n q: the query n n n O: the set of spatial objects, each has n n n A location A set of keywords Relevant object Collective Spatial Keyword Query (Co. SKQ): Find a set S of objects such that n n 10 A location A set of keywords S covers the set of query keywords; S is feasible the cost of S, denoted by cost(S) (defined later), is the smallest.

Problem Definition (2) n Max. Sum Cost n linear combination of two max components

Problem Definition (2) n Max. Sum Cost n linear combination of two max components n n n cost. Max. Sum(S) = a * max(S, q) + (1 -a) * max(S, S) n Following the convention, we set a = 0. 5 by default. n cost. Max. Sum(S) = max(S, q) + max(S, S) Diameter Cost n n n 11 max(S, q) and max(S, S) max(S, q) vs. max(S, S) Use a “max” operation! cost. Dia(S) = max{max(S, q), max(S, S)}

Outline n n Introduction Contribution Problem Definition Max. Sum-Co. SKQ n n n 12

Outline n n Introduction Contribution Problem Definition Max. Sum-Co. SKQ n n n 12 Finding Optimal Solution: Max. Sum-Exact Finding Approximate solution: Max. Sum-Appro Dia-Co. SKQ Experimental Results Conclusion

Cost function: Cost(S) = max(S, q) + max(S, S) Max. Sum-Co. SKQ: Finding Optimal

Cost function: Cost(S) = max(S, q) + max(S, S) Max. Sum-Co. SKQ: Finding Optimal Solutions (1) Query distance owner max(S, q) n Some basic concepts n n n Query distance owner Pairwise distance owner Distance owner group (o, o 1, o 2)-consistency Pairwise Key observation n distance owner Pairwise distance owner One “distance owner group” usually corresponds to many feasible sets! 13 max(S, S) … Same distance owner group (o, o 1, o 2)

Collective Spatial Keyword Query (Co. SKQ): Find a set S of objects such that

Collective Spatial Keyword Query (Co. SKQ): Find a set S of objects such that n S is feasible; n the cost of cost(S) is minimized. Max. Sum-Co. SKQ: Finding Optimal Solutions (2) The size is exponential in terms of the number of relevant objects! Feasible set space Cao-Exact Distance ownerdriven approach S 1 (, , )1 S 2 … S 3 (, , )2 … Sn (, , )m Distance owner group space The size is cubic in terms of the number of relevant objects. 14 Search directly!

A subset of the triplet space Max. Sum-Co. SKQ: Finding Optimal Feasible set space

A subset of the triplet space Max. Sum-Co. SKQ: Finding Optimal Feasible set space Solutions (2) A distance owner-driven approach Maintain a best-known feasible set S For each triplet (o, o 1, o 2) If there exists a feasible set S’ which is (o, o 1, o 2)-consistent then Issue 1 Issue 2 S S’ if cost(S’) < cost(S) Return S n n … Sn … (, , )m Distance owner group space (, , )1 (, , )2 A straightforward one checks cubic candidates! Pruning! Issue 2: How to check for a triplet (o, o 1, o 2) whethere exists a feasible set S’ which is (o, o 1, o 2)-consistent? n 15 S 3 Issue 1: How to search over the “triplet” space? n n S 2 S 1 Should be efficient!

A distance owner-driven approach Maintain a best-known feasible set S For each triplet (o,

A distance owner-driven approach Maintain a best-known feasible set S For each triplet (o, o 1, o 2) If there exists a feasible set S’ which is (o, o 1, o 2)-consistent then S S’ if cost(S’) < cost(S) Return S Issue 1: How to search over the “triplet” space? n Not all relevant objects need to be considered as the candidates of the query distance o. n o cannot be too close to q. n n Lower bound of d(o, q) ≥ rmin = d(of, q), of is the farthest keyword NN from q. Objects that are too far away from q can be ignored. n Upper bound of d(o, q) n d(o, q) ≤ rmax = cost(S) A “ring” region, R(S).

A distance owner-driven approach Maintain a best-known feasible set S For each triplet (o,

A distance owner-driven approach Maintain a best-known feasible set S For each triplet (o, o 1, o 2) If there exists a feasible set S’ which is (o, o 1, o 2)-consistent then S S’ if cost(S’) < cost(S) Return S Issue 1: How to search over the “triplet” space? n Once the candidate of the query distance owner, says o, is fixed, the pairwise distance owners o 1 and o 2 are constrained. n n Restricted in Disk(q, d(o, q))! d(o 1, o 2) cannot be too small! n n d(o 1, o 2) ≥ dmin = d(o, q) – min{d(o 1, q), d(o 2, q)} triangle inequality Those with large d(o 1, o 2) can be pruned! n n n 17 Lower bound of d(o 1, o 2): Upper bound of d(o 1, o 2) ≥ dmax = cost(S) – d(o, q) Best-known solution S

A distance owner-driven approach Maintain a best-known feasible set S For each triplet (o,

A distance owner-driven approach Maintain a best-known feasible set S For each triplet (o, o 1, o 2) If there exists a feasible set S’ which is (o, o 1, o 2)-consistent then S S’ if cost(S’) < cost(S) Return S Issue 1: How to search over the “triplet” space? n Candidates of o: n n n Ring region R(S) Ascending order of the distances from q. For each candidate of o, the candidates of o 1 and o 2: n Disk(q, d(o, q)) The ring shrinks progressively! For the pairwise distance owner o 1, o 2: Lower bound of d(o 1, o 2) ≥ dmin = d(o, q) – min{d(o 1, q), d(o 2, q)} Upper bound of d(o 1, o 2) ≤ dmax = cost(S) – d(o, q) 18

A distance owner-driven approach Maintain a best-known feasible set S 1 For each triplet

A distance owner-driven approach Maintain a best-known feasible set S 1 For each triplet (o, o 1, o 2) If there exists a feasible set S’ which is (o, o 1, o 2)-consistent then S S’ if cost(S’) < cost(S) Return S Issue 2: How to check for a triplet (o, o , o 2) whethere exists a feasible set S’ Issue 2 which is (o, o 1, o 2)-consistent? n Restrictions on S’ (if it exists) n n n n D(q, d(o, q)) Exhaustive search for S’ in the intersection of the three disks with the above restrictions! Inverted file could n n 19 d(o 1, o 2) ≥ d(o, o 1) d(o 1, o 2) ≥ d(o, o 2) S’ is inside Disk(o, d(o, q)) S’ is inside Disk(o 1, d(o 1, o 2)) S’ is inside Disk(o 2, d(o 1, o 2)) S’ covers the query keywords. D(o 1, d(o 1, o 2)) D(o 2, d(o 1, o 2)) If it succeeds, return S’; Otherwise, we know that S’ does not exist! be utilized here. With the two issues fixed, Max. Sum-Exact is complete!

Outline n n Introduction Contribution Problem Definition Max. Sum-Co. SKQ n n n 20

Outline n n Introduction Contribution Problem Definition Max. Sum-Co. SKQ n n n 20 Finding Optimal Solution: Max. Sum-Exact Finding Approximate solution: Max. Sum-Appro Dia-Co. SKQ Experimental Results Conclusion

Max. Sum-Co. SKQ: Finding Approximate Solution (1) Constrained NN n o-neighborhood feasible set n

Max. Sum-Co. SKQ: Finding Approximate Solution (1) Constrained NN n o-neighborhood feasible set n n The set containing all Disk(o, d(o, q))-constrained keyword t. NN for each query keyword t. E. g. , o 3 -neighborhood feasible set n n n For t 1: Disk(o 3, d(o 3, q))-constrained keyword t 1 -NN is o 2. For t 2: Disk(o 3, d(o 3, q))-constrained keyword t 2 -NN is o 5. For t 3: Disk(o 3, d(o 3, q))-constrained keyword t 3 -NN is o 3. 21 - region - keyword o 3 -neighborhood feasible set is {o 2, o 3, o 5}.

Max. Sum-Co. SKQ: Finding Approximate Solution (1) The costly part 22 A distance owner-driven

Max. Sum-Co. SKQ: Finding Approximate Solution (1) The costly part 22 A distance owner-driven approach Algorithm: Max. Sum-Appro Maintain a best-known feasible set S For each triplet (o, o 1, o 2) For each relevant object o in R(S) If there exists a feasible set S’ S’ the o-neighborhood feasible set which is (o, o 1, o 2)-consistent then S S’ if cost(S’) < cost(S) Return S

A distance owner-driven approach Algorithm: Max. Sum-Appro Maintain a best-known feasible set S For

A distance owner-driven approach Algorithm: Max. Sum-Appro Maintain a best-known feasible set S For each triplet (o, o 1, o 2) For each relevant object o in R(S) If there exists a feasible set S’ S’ the o-neighborhood feasible set which is (o, o 1, o 2)-consistent then S S’ if cost(S’) < cost(S) Return S Max. Sum-Co. SKQ: Finding Approximate Solution (2) n Approximation bound n n n Time complexity n n 23 Max. Sum-Appro is a 1. 375 -factor approximation. Refer to our paper for the proof if you are interested. O(nr* |q| * log |O|) It has the same as the worst-case time complexity as Cao. Appro 2, but a smaller approximation factor (1. 375 -factor vs. 2 -factor).

Outline n n n n 24 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co.

Outline n n n n 24 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co. SKQ Experimental Results Conclusion

Dia-Co. SKQ (1): Finding Exact Solutions n cost. Dia(S) = max{max(S, q), max(S, S)}

Dia-Co. SKQ (1): Finding Exact Solutions n cost. Dia(S) = max{max(S, q), max(S, S)} n n max(S, q): determined by the query distance owner max(S, S): determined by the pairwise distance owners Dominated by the “distance owner group” of S We can apply the distance owner-driven approach to the Dia-Co. SKQ problem! n with several updates. Pairwise Distance Owner o 1, o 2: Lower bound of d(o 1, o 2) ≥ dmin = d(o, q) – min{d(o 1, q), d(o 2, q)} d(o, q) Upper bound of d(o 1, o 2) ≤ dmax = cost(S) – d(o, q) cost(S) 25

Dia-Co. SKQ (2): Finding Approximate Solution n 26

Dia-Co. SKQ (2): Finding Approximate Solution n 26

Dia-Co. SKQ (3): Adaptions of Existing Solutions n 27

Dia-Co. SKQ (3): Adaptions of Existing Solutions n 27

Outline n n n n 28 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co.

Outline n n n n 28 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co. SKQ Experimental Results Conclusion

No. of objects GN Web Hotel 1, 868, 821 579, 727 20, 790 2,

No. of objects GN Web Hotel 1, 868, 821 579, 727 20, 790 2, 899, 175 602 249, 132, 88 80, 845 Experimental Results: Set-Up No. of unique 222, 409 words n Datasets: n n n 29 GN, Web and Hotel (the same datasets as used by Cao et al. ) Location and query keywords Algorithms n n 18, 374, 228 Query Generation n n No. of words Max. Sum-Co. SKQ: Cao-Exact, Cao-Appro 1, Cao-Appro 2, Max. Sum-Exact, Max. Sum-Appro Dia-Co. SKQ: Cao-Exact, Cao-Appro 1, Cao-Appro 2, Dia. Exact, Dia-Appro Factors & Measures n No. of query keywords and no. of average keywords contained by an object

Experimental Results: Performance Study Max. Sum-Exact runs faster than Cao-Exact (1) by up to

Experimental Results: Performance Study Max. Sum-Exact runs faster than Cao-Exact (1) by up to 3 orders of magnitude. n n n 30 Problem: Max. Sum-Co. SKQ Our Max. Sum-Appro runs fast and is Dataset: Web comparable with Cao-Appro 2. Factor: |q. �� | Our Max. Sum-Appro returns near-to-optimal solution.

Experimental Results: Performance Study (2) n n n 31 Problem: Max. Sum-Co. SKQ |.

Experimental Results: Performance Study (2) n n n 31 Problem: Max. Sum-Co. SKQ |. Dataset: Web Cao-Exact is not scalable wrt |o. �� Our Max. Sum-Exact is scalable wrt | o. �� Factor: |o. �� | |.

Experimental Results: Performance Study (3) Cao-Exact runs more than 10 days when the data

Experimental Results: Performance Study (3) Cao-Exact runs more than 10 days when the data size is abut 8 millions! n n n 32 Problem: Max. Sum-Co. SKQ and Dia-Co. SKQ Max. Sum-Exact is still fast (≤ 100 s) Dataset: GN when the data size is millions. Scalability test. Max. Sum-Appro runs in real time (≤ 1 s).

Outline n n n n 33 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co.

Outline n n n n 33 Introduction Contribution Problem Definition Max. Sum-Co. SKQ Dia-Co. SKQ Experimental Results Conclusion

Conclusion n n 34 Collective Spatial Keyword Query problem A distance owner-driven approach. Exact

Conclusion n n 34 Collective Spatial Keyword Query problem A distance owner-driven approach. Exact and approximate algorithms. Extensive experiments.

My research interest n Databases Queries and/or Data Mining on n Spatial data n

My research interest n Databases Queries and/or Data Mining on n Spatial data n n Spatial-textual data n n E. g. , viral marketing [ICDM’ 11] Graph n 35 E. g. , trajectory compression [VLDB’ 13] Social network data n n E. g. , spatial keyword query [SIGMOD’ 13] Trajectory data n n E. g. , spatial matching [SIGMOD’ 13] E. g. , shortest path queries etc.

Q & A 36

Q & A 36