Efficient XML Stream Processing with Automata and Query

  • Slides: 41
Download presentation
Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student:

Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner Prof. Kathi Fisler 1

The Need for XML Stream Processing New paradigms n q q n New applications

The Need for XML Stream Processing New paradigms n q q n New applications q q n Distributed data provider Distributed data consumer Monitoring (e. g. , sensor network) Information Filtering (e. g. , news, email) New challenges q q Arbitrarily nested structure Incomplete knowledge XML Stream Processing Engine XML data streams Internet XML Relational news HTML 2

Two Existing Approaches n n n Automata-based [xfilter 01, yfilter 02, x-scan 01, …]

Two Existing Approaches n n n Automata-based [xfilter 01, yfilter 02, x-scan 01, …] Algebraic [tukwila 01, rainbow 02, …] This thesis intends to integrate the both existing approaches into one system 3

A Running Example <bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W. </first></author> <publisher>Addison-Wesley</publisher> <price> 65. 95</price>

A Running Example <bib> <book year="1994"> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W. </first></author> <publisher>Addison-Wesley</publisher> <price> 65. 95</price> </book> <book year="2000"> <title>Data on the Web</title> <author><last>Abiteboul</last><first>Serge</first></author> <author><last>Buneman</last><first>Peter</first></author> <author><last>Suciu</last><first>Dan</first></author> <publisher>Morgan Kaufmann Publishers</publisher> <price>39. 95</price> </book> <book year="1992"> <title>Advanced Programming in the Unix environment</title> <author><last>Stevens</last><first>W. </first></author> <publisher>Addison-Wesley</publisher> <price>65. 95</price> </book> </bib> Give me book titles whose price is grater than 50: <result> FOR $b in doc (bib. xml) //book WHERE $b/price > 50 RETURN <expensive> $b/title </expensive> </result> <expensive> <title>TCP/IP Illustrated</title> </expensive> <title>Advanced Programming in the Unix environment</title> </expensive> </result> 4

XML as a Stream of Tokens bib book title author book publisher Text price

XML as a Stream of Tokens bib book title author book publisher Text price Text last first Text book Text n A token can be: q q q An open tag A close tag PCDATA Input XML stream <bib> <book> <title> TCP/IP Illustrated </title> <author> <last> Stevens</last> …</book>… timeline 5

Basic State-Transition Model FOR $b in doc (bib. xml) //book WHERE $b/price > 50

Basic State-Transition Model FOR $b in doc (bib. xml) //book WHERE $b/price > 50 RETURN $b/title Q : = //book/price * 0 ε 1 book price 2 3 input <bib> <book> <title> TCP/IP Illustrated </title> <price> active states stack 0 65. 95 </price> … 1 1, 2 1, 3 … … [0] [1] [1, 2] [0] [1] [1, 2] … … 6

Extended with Data Buffer and Buffer Operations FOR $b in doc (bib. xml) //book

Extended with Data Buffer and Buffer Operations FOR $b in doc (bib. xml) //book WHERE $b/price > 50 RETURN $b/title * 0 ε 1 1. write buffer 2. output if flag is set title book 2 3 * buffer price 4 flag * 1. eval pred and set/clear flag 2. output if buffer not empty n Data-driven q q Token at a time Fixed order 7

Algebraic Query Plan FOR $b in doc (bib. xml) //book WHERE $b/price > 50

Algebraic Query Plan FOR $b in doc (bib. xml) //book WHERE $b/price > 50 RETURN $b/title Tagger Navigate //book, title Select price > 50 n n Set at a time Postponed operation Navigate //book, price Extract //book 8

Exploit the Flexibility of Postponed Operations Tagger FOR $b in doc (bib. xml) //book

Exploit the Flexibility of Postponed Operations Tagger FOR $b in doc (bib. xml) //book WHERE $b/price > 50 and $b/author/last = “Stevens” RETURN $b/title Navigate //book, title Select last = “Stevens” Navigate //book, author/last Select price > 50 Navigate //book, price Extract //book 9

Query Optimization in Algebraic Systems n Logical optimization q q q n Physical optimization

Query Optimization in Algebraic Systems n Logical optimization q q q n Physical optimization q n Selection pushdown Projection pushdown Join order selection Operator algorithms Runtime optimization q q Scheduling Resource allocation 10

Thesis Overview n Motivation q q n The Automata model is good for on-the-fly

Thesis Overview n Motivation q q n The Automata model is good for on-the-fly pattern matching/retrieval The Algebraic model is good for optimizing complex queries Major challenges q q How to integrate the two models? How to optimize a query within the integrated query model? 11

The Raindrop Approach n n Integration Optimization 12

The Raindrop Approach n n Integration Optimization 12

Path Bindings in XQuery FLWR expression: n FOR…LET. . . WHERE…RETURN… Path bindings Filtering

Path Bindings in XQuery FLWR expression: n FOR…LET. . . WHERE…RETURN… Path bindings Filtering and restructuring FOR $b in doc (bib. xml) //book WHERE $b/price > 50 and RETURN $b/title “The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W 3 C] FOR $b in doc (bib. xml) //book LET $p : = $b/price, $t : = $b/title WHERE $p > 50 RETURN $t 13

A Two-Tier System Architecture Query answer Master plan Tuple stream XML data stream Automata

A Two-Tier System Architecture Query answer Master plan Tuple stream XML data stream Automata plan 14

Modeling the Master Plan: Algebraic Tagger Navigate //book, title Select last = … Navigate

Modeling the Master Plan: Algebraic Tagger Navigate //book, title Select last = … Navigate //book, author/last Select price > 50 Navigate //book, price 15

Modeling the Automata Plan: Black Box vs. White Box SJoin Automata Plan Q 1

Modeling the Automata Plan: Black Box vs. White Box SJoin Automata Plan Q 1 : = //book Q 2 : = //book/price Q 3 : = //book/title //book Extract //book/price //book/title 16

How to optimize it? Query answer Master plan Tuple stream XML data stream Automata

How to optimize it? Query answer Master plan Tuple stream XML data stream Automata plan 17

Optimization: A Unified Process in the Logical View Navigate Master Plan //book, //book/title Navigate

Optimization: A Unified Process in the Logical View Navigate Master Plan //book, //book/title Navigate Select //book, title //book/price >5 0 Select Navigate price >5 0 //book, //book/price $a $b $c a b C a B c Extract //book Navigate //book, price Automata Plan Extract //book * 0 ε 1 book 2 18

The Algebra Core Op Symbol Semantic Selection Filter tuples based on the predicate pred

The Algebra Core Op Symbol Semantic Selection Filter tuples based on the predicate pred Projection Filter columns in the input tuples based on the variable list v Join input tuples based on the predicate pred Aggregate over input tuples with the aggregate function f, e. g. , sum and average Tagger Format outputs based on the pattern pt, i. e. , reconstruct XML tags Navigate Take input elements of path p 1 and output ancestor elements of path p 2 Extract Identify elements of path p from the input stream Structural Join input tuples on their structural relationship, e. g, the common parent relationship p 19

The Extract Operator <title> TCP/IP Illustrated </title> <title> Data on the Web </title> <title>Advanced

The Extract Operator <title> TCP/IP Illustrated </title> <title> Data on the Web </title> <title>Advanced Programming in the Unix environment</title> Extract //book/title * 0 ε 1 book 1 title 2 <bib> <book> <title> TCP/IP Illustrated </title> … </book>… 20

The Structural Join Operator FOR $b in doc (bib. xml) //book LET $p :

The Structural Join Operator FOR $b in doc (bib. xml) //book LET $p : = $b/price, $t : = $b/title WHERE $p > 50 RETURN $t 0 1 <price>…</price> <title>…</title> <price>…</price> SJoin //book Extract //book/title //book/price * ε <title>…</title> title book 2 3 price <bib> <book> <title> TCP/IP Illustrated </title> 4 … </book>… </book> 21

The Navigate Operator <book>… … </book> <title>…</title> Navigate //book, title n A navigate operation

The Navigate Operator <book>… … </book> <title>…</title> Navigate //book, title n A navigate operation can be postponed, independent of the input stream <book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65. 95 </price> </book> <book>… … </book> 22

A Special Optimization: In or Out? Query answer Master plan Tuple stream XML data

A Special Optimization: In or Out? Query answer Master plan Tuple stream XML data stream Automata plan 23

Two Options: Bottom-up vs. Topdown <title>…</title> <price>…</price <book>… … </book> <title>…</title> <price>…</price> <title>…</title> <price>…</price>

Two Options: Bottom-up vs. Topdown <title>…</title> <price>…</price <book>… … </book> <title>…</title> <price>…</price> <title>…</title> <price>…</price> <book>… … </book> <title>…</title> <book year="1994"> <title>TCP/IP Illustrated</title> <author> <last> Stevens </last> <first> W. </first> </author> <publisher> Addison-Wesley </publisher> <price> 65. 95 </price> </book> <book>… … </book> 24

Exploiting the Options for Optimization Tagger Navigate //book, title Select //book/price >50 Select price

Exploiting the Options for Optimization Tagger Navigate //book, title Select //book/price >50 Select price >5 0 SJoin Navigate //book Extract //book/title //book/price 0 1 Extract //book * ε //book, price book title 2 3 price The push-in plan * 4 0 ε 1 book 2 The pull-out plan 25

Query Optimization by Rewriting Rules n Navigate pushin: n Redundant SJoin: n Redundant Extract:

Query Optimization by Rewriting Rules n Navigate pushin: n Redundant SJoin: n Redundant Extract: n Selection Pushdown: n Etc. . n Algebraic transformation: 26

Runtime Optimization: Why? n Optimization relies on cost estimation, which in terms relies on

Runtime Optimization: Why? n Optimization relies on cost estimation, which in terms relies on statistics q q Statistics unknown Statistics change Tagger Navigate //book, title Select price >5 0 Navigate //book, price Extract //book 27

Runtime Optimization Steps Stat Collection Decision Making Plan Migration 28

Runtime Optimization Steps Stat Collection Decision Making Plan Migration 28

Why Need Migration? The migration process executor Optimizer Optimization cycle n When to interrupt

Why Need Migration? The migration process executor Optimizer Optimization cycle n When to interrupt the executor q q Master plan Automata plan Legend Normal execution Prepare for migration Decision making Plan modification 29

Modifying the Automata: A Bad Example Navigate //book, //book/title Select //book/price >50 Select //book/price

Modifying the Automata: A Bad Example Navigate //book, //book/title Select //book/price >50 Select //book/price >5 0 SJoin //book Extract //book/title //book/price 0 1 //book, //book/price Extract * ε Navigate book title 2 //book 3 price * 4 0 ε 1 book 2 <bib> <book> <title> TCP/IP Illustrated </title> <price> 36. 65 </price> … </book>……<book> 30

Modifying the Automata: A Safe Approach * FOR $b in doc (bib. xml) //book

Modifying the Automata: A Safe Approach * FOR $b in doc (bib. xml) //book LET $p : = $b/price, $t : = $b/title WHERE $p > 50 RETURN $t 0 ε 1 book 2 * 0 ε 1 book 3 title 2 price 4 <bib> <note>…</note> <book> … </book> <book>…</book> <note> …</note> … Safe point Unsafe point 31

Experimental Study n n n Is it feasible to integrate automata model and algebraic

Experimental Study n n n Is it feasible to integrate automata model and algebraic model? Is push-in vs. pull-out a feasible optimization? Is runtime optimization worthwhile? 32

Experimental Setup n n Java 1. 4 Pentium III-750 MHz, 384 MB Windows XP

Experimental Setup n n Java 1. 4 Pentium III-750 MHz, 384 MB Windows XP Professional Three-party components q q q Xerces SAX parser The Kweelt XQuery parser Rainbow core 33

Exp 1: System Throughput 34

Exp 1: System Throughput 34

Exp 2: Push-in vs. Pull-out 35

Exp 2: Push-in vs. Pull-out 35

Exp 3: Runtime Optimization 36

Exp 3: Runtime Optimization 36

Related work n Automata-based XML processing q n Algebraic XQuery Engine q n XFilter,

Related work n Automata-based XML processing q n Algebraic XQuery Engine q n XFilter, YFilter, X-Scan, XTrie, XPush, … XPeranto, Lego. DB, Rainbow, Timber… Runtime Optimization q Tukwila, Telegraph CQ, … 37

Contribution n n While many recent XML stream work (e. g. , in SIGMOD

Contribution n n While many recent XML stream work (e. g. , in SIGMOD 03) processes XPath query, we are among the first to deal with XQuery We are the first to consider the flexible automata and query algebra integration problem Pushin vs. Pullout optimization techniques Prototype system Experimental study 38

Conclusion n Combining automata and query algebra results in a very power query model

Conclusion n Combining automata and query algebra results in a very power query model for XML stream processing Special optimization techniques (e. g. , pushin vs. pullout) can be applied in the integrated system Data statistics collected at runtime can be exploited via runtime optimization techniques 39

Thanks to: n n Prof. Elke A. Rundensteiner Prof. Kathi Fisler The Raindrop/Rainbow team

Thanks to: n n Prof. Elke A. Rundensteiner Prof. Kathi Fisler The Raindrop/Rainbow team All DSRG members 40

Questions? 41

Questions? 41