Probabilistic Data Management Chapter 8 Probabilistic Query Answering

Probabilistic Data Management Chapter 8: Probabilistic Query Answering (6)

Objectives n In this chapter, you will: q Explore the definitions of more probabilistic query types n Probabilistic top-k query 2

Recall: Probabilistic Query Types n Uncertain/probabilistic database q q q q Probabilistic Spatial Query Probabilistic range query Probabilistic k-nearest neighbor query Probabilistic group nearest neighbor (PGNN) query Probabilistic reverse k-nearest neighbor query Probabilistic spatial join /similarity join Probabilistic top-k query (or ranked query) Probabilistic skyline query Probabilistic reverse skyline query Preference Query 3

Motivation Example n n n In a coal mine surveillance application, a number of sensors are deployed to detect density of gas, temperature, and so on Assume we have a preference function f(O) = O. temp + O. den Top-k query: Retrieve k sensors with the highest scores (most dangerous) 4

Motivation Example (cont'd) n n n Sensor data usually contain noises The reported data can be modeled as uncertain objects Obtain top-k query answers over uncertain data with high confidence actual data 5

Background of Probabilistic Top-k Query n Under possible worlds semantics q q Each tuple t is associated with a score t. score Each tuple t is associated with an existence probability t. prob possible worlds query answer in possible worlds 6

Different Semantics of Probabilistic Top-k Query n Top-k query in probabilistic databases q q n Consider each possible world from which top-k answers are retrieved Aggregate the top-k answers (weighted by the probabilities of possible worlds) Aggregation Semantics q q q Uncertain Top-k (U-Topk) [Soliman et al. , ICDE 2007] Uncertain Rank-k (U-k. Rank) [Soliman et al. , ICDE 2007] Probabilistic Threshold Top-k (PT(h)) [Hua et al. , SIGMOD 2008] Expected Ranks (Exp-Rank) [Cormode et al. , ICDE 2009] Expected Score (E-Score) [Cormode et al. , ICDE 2009] 7

Uncertain Top-k (U-Topk) [Soliman et al. , ICDE 2007] group by top-k answer vectors top-k answer vector …… …… possible worlds …… …… probabilistic database Find one top-k answer vector that appears in possible worlds with the highest probability U-Topk answers 8

Example of U-Topk n Given the Uncertain Database and k=2 Tuple Score P(t) Rules t 1 100 0. 4 R 1 { t 1 } t 2 85 0. 5 R 2 { t 2 , t 4 } t 3 70 1 R 3 { t 3 } t 4 60 0. 5 Possible World (W) Pr(W) { t 1 , t 2 , t 3 } P(t 1)P(t 2)P(t 3) = 0. 2 { t 1 , t 3 , t 4 } P(t 1)P(t 3)P(t 4) = 0. 2 { t 2 , t 3 } (1 -P(t 1))P(t 2)P(t 3) = 0. 3 { t 3 , t 4 } (1 -P(t 1))P(t 3)P(t 4) = 0. 3 Pr[{ t 1, t 2 }] = 0. 2 Pr[{ t 1, t 3 }] = 0. 2 Pr[{ t 2, t 3 }] = 0. 3 Pr[{ t 3, t 4 }] = 0. 3 Final Result: {t 2, t 3} or {t 3, t 4} 9

Uncertain Rank-k (U-k. Ranks) [Soliman et al. , ICDE 2007] n For some j [1, k], group by tuples with the j-th rank tuple with the j-th rank …… …… possible worlds …… …… probabilistic database For each j [1, k], find one tuple that has the jth rank in possible worlds with the highest probability U-k. Rank answers 10

Example of U-k. Ranks n Given the Uncertain Database and k=2 Tuple Score P(t) Rules t 1 100 0. 4 R 1 { t 1 } t 2 85 0. 5 R 2 { t 2 , t 4 } t 3 70 1 R 3 { t 3 } t 4 60 0. 5 Possible World (W) Pr(W) { t 1 , t 2 , t 3 } P(t 1)P(t 2)P(t 3) = 0. 2 { t 1 , t 3 , t 4 } P(t 1)P(t 3)P(t 4) = 0. 2 { t 2 , t 3 } (1 -P(t 1))P(t 2)P(t 3) = 0. 3 { t 3 , t 4 } (1 -P(t 1))P(t 3)P(t 4) = 0. 3 At rank i = 1: Pr[t 1] = 0. 4 Pr[t 2] = 0. 3 Pr[t 3] = 0. 3 At rank i = 2: Pr[t 2] = 0. 2 Pr[t 3] = 0. 5 Pr[t 4] = 0. 3 Final Result: {t 1, t 3} 11

Probabilistic Threshold Top-k (PT(h)) [Hua et al. , SIGMOD 2008] group by tuples in top-h answer sets top-h answer set …… …… possible worlds …… …… probabilistic database Find k tuples that are in top-h answer sets of possible worlds with the highest probabilities PT(h) answers 12

Example of PT-k n Given the Uncertain Database, k=2, Threshold=0. 5 Tuple Score P(t) Rules t 1 100 0. 4 R 1 { t 1 } t 2 85 0. 5 R 2 { t 2 , t 4 } t 3 70 1 R 3 { t 3 } t 4 60 0. 5 Possible World (W) Pr(W) { t 1 , t 2 , t 3 } P(t 1)P(t 2)P(t 3) = 0. 2 { t 1 , t 3 , t 4 } P(t 1)P(t 3)P(t 4) = 0. 2 { t 2 , t 3 } (1 -P(t 1))P(t 2)P(t 3) = 0. 3 { t 3 , t 4 } (1 -P(t 1))P(t 3)P(t 4) = 0. 3 Pr[t 1] = 0. 4 Pr[t 2] = 0. 5 Pr[t 3] = 0. 8 Pr[t 4] = 0. 3 Threshold=0. 5 Pr[t 2] = 0. 5 Pr[t 3] = 0. 8 Final Result: {t 2, t 3} 13

Expected Ranks (Exp-Rank) [Cormode et al. , ICDE 2009] expected rank of t 1: pw rpw(t 1) Pr(pw) t 1 t 2 …… …… …… alternatives …… …… …… probabilistic database Find k tuples with the highest expected ranks possible worlds 14

Expected Score (E-Score) [Cormode et al. , ICDE 2009] expected score of t : 1 pw score(t 1) Pr(pw) t 1 t 2 …… …… …… alternatives …… …… …… probabilistic database Find k tuples with the highest expected scores possible worlds 15

Example of Expected Ranks n Given the Uncertain Database and k=2 Tuple Score P(t) Rules t 1 100 0. 4 R 1 { t 1 } t 2 85 0. 5 R 2 { t 2 , t 4 } t 3 70 1 R 3 { t 3 } t 4 60 0. 5 Possible World (W) Pr(W) { t 1 , t 2 , t 3 } P(t 1)P(t 2)P(t 3) = 0. 2 { t 1 , t 3 , t 4 } P(t 1)P(t 3)P(t 4) = 0. 2 { t 2 , t 3 } (1 -P(t 1))P(t 2)P(t 3) = 0. 3 { t 3 , t 4 } (1 -P(t 1))P(t 3)P(t 4) = 0. 3 If a tuple doesn’t appear in a world, its rank is considered to be the last one E[R(t 1)] = 1× 0. 2+3× 0. 3+3× 0. 3= 2. 2 E[R(t 2)] = 2. 4 E[R(t 3)] = 1. 9 E[R(t 4)] = 2. 9 Final Result: {t 3, t 1} 16

Unified Ranking Functions n Parameterized Ranking Function (PRF) weighted function n A probabilistic top-k query returns k tuples with the highest |gw| values Li, J. , Deshpande, A. A Unified Approach to Ranking in Probabilistic Databases. In VLDB, 2009. 17

Unified Ranking Functions (cont'd) n n When w(t, i) = 1, the result is the set of k tuples with the highest probability When w(t, i) = score(t), E-Score When , PT(h) When , U-Rank PRF cannot simulate U-Topk 18

Unified Ranking Functions (cont'd) n Two new semantics PRFw(h) and PRFe(h) q q PRFw(h): w(t, i) = wi for i h, and w(t, i) = 0 for i > h PRFe(h): w(t, i) = a i, where a can be a real/complex number 19

Ranking Algorithms n Assuming tuple independence q q Compute the probability that a tuple ti has the j-th rank Observation: the coefficient cj of xj in a function, Fi(x), is exactly the probability that ti is at rank j 20

Example n Incremental computation of Fi(x): Consider the rank of a tuple t 3, . 4 x 21

Ranking Algorithms (cont'd) n Assuming correlated database represented by and/xor tree q Generating functions on the and/xor tree Observation: the coefficient cj of the term xj-1 y is Pr(r(ti) = j) 22

Summary n Probabilistic top-k query q q Different semantics w. r. t. ranks and probabilities in possible worlds A unified approach 23