Cooperative Query Answering for Semistructured Data By Michael

Cooperative Query Answering for Semistructured Data By Michael Barg and Raymond K. Wong Speakers: Chuan Lin & Xi Zhang

Outline l l l Motivations Overview Basic Concepts Cooperative Query Processing Experiment

Motivations l XML data – – same semantic content very different structures

Example: same semantics, diff structures Court Transcript: plaintiff User Query: woman “insurance claims” related to “smoking” for “woman” insurance claim Insurance Record: smoking insurance claim insurer woman smoking

Motivations l No exact query result User Query: “phone number” of “Bob” Who is the new “sales manager” Data: personnel sales manager Joe salesman assistant salesman sales manager Bob phone number

Overview l Goal: – – l Return approximate answers for XML queries “approximate”: semantic + structural similar Solution: – – Return a set of results ranked by an overall score: indicates how well the subgraph containing the result satisfies the query criteria.

Basic Concepts: Query: Query Tree /restaurant[. //Soho]/phone_number Query Tree: Result Term t soho h restaurant h t r phone_number For each edge: “head”: the end which is closer to nearest result term “end”: the other end In case of tie, “head” is the end closer to root

Basic Concepts: Converging Order l l Order of edges considered in query processing Converge on a result term

Basic Concepts: l Similarity Semantically similar topologies shopping_ center restaurant soho restaurant address eating_ places soho (a) (b) restaurant soho restaurant (c) (d) (e)

Basic Concepts: Similarity (cont. ) l Deviation Proximity (DP) – – Measure how far one structure deviates from a desired structure Given: l ra : l l – data node with value a rb: data node with value b Q(a, b): query tree edge DP: the actual position of rb to the nearest position, r’b, which satisfies the topological relationship specified by Q(a, b) l Topological relationship: parent-child, ancestor-descendent

Deviation Proximity Q (restaurant, soho) requires parent-child relationship shopping_ center restaurant soho (soho’) soho address (soho’) restaurant eating_ places soho (soho’) restaurant 1 (soho’) DP(restauarent, soho): 0 restaurant soho 2 3 3

Deviation Proximity Q (restaurant, soho) requires anc-desc relationship shopping_ center restaurant soho (soho’) soho address restaurant eating_ places soho (soho’) restaurant 0 (soho’) DP(restauarent, soho): 0 restaurant soho 2 3 3

Cooperative Query Processing l l l Input: a Query Tree QT, an XML Document Tree DT Output: ordered list of <rresult_term, score> Cooperative Query Processing – – Structural proximity calculation Progressive Score

Cooperative Query Processing (cont. ) l Progressively matching edges in QT with DT – – Consider edges in converging order For each edge QT(a, b), where a is head and b is tail, get a list of <ra, score> l ra l l is a node in DT with value a score is the progressive score of ra w. r. t the nearest rb use graph encoding to calculate structural proximity of ra and rb

Structural Proximity Calculation l Encodings and Compressed Arrays – – – l Compact Preserve relationship to a larger graph Facilitate distance calculations Proximity Searching

Encodings and Compressed Arrays l Basic Concepts: – – – l Path representation – – – l Common Node Terminal Node Annotated Node Representing Single Path Representing Multiple Paths Representing Multiple Elements Compressed Arrays – Each encoding is a path/muti-path for a node/a set of nodes

Encodings and Compressed Arrays

Representing Single Path 1. 1. 1 y 1 1. 2. 1. 1 y 2

Representing Multiple Paths 1. 3 B. B. 2. 1. 1 C. 3 C. C. 2 y 3

Representing Multiple Elements 1 A. A. 1. 1 y 1. 2. 1. 1 y 2. 3 B. B. 2. 1. 1 C. 3 C. C. 2 y 3

Compressed Arrays

Drawback of Encoding l 1 A. A. 1 B. B. 1 D. 2 E. ? . 2 C. C. 1 F. 2 G

Proximity Searching l Multi-Element Comparison – Input: l l – A compressed array, ca. N, containing the multi-element encoding of the Near Set. A compressed array, ca. F, containing the multi-path encoding or path encoding of all paths from the root to the specified element of the Find Set, EF. output: l dist, the shortest path from EF to the closest element in Near Set

Proximity Searching Min. Dist=5 Min. Dist = 4 Min. Dist = 2

Progressive Score l Accumulative Deviation Proximity (DP) – l Calculated from structural proximity Boolean operator at Query Tree branches a a b c prog(a) = prog(b)+prog(c) b c prog(a) = min (prog(b), prog(c))

Experiment Query: //restaurant/soho Query Result: <soho, 2> <soho, 3> <soho, 4> XML:

Thank you!

Questions & Answers