Parsing for XML Developers Roger L Costello 28

  • Slides: 32
Download presentation
Parsing for XML Developers Roger L. Costello 28 September 2014

Parsing for XML Developers Roger L. Costello 28 September 2014

Flat XML Document You might receive an XML document that has no structure. For

Flat XML Document You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data: <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 2

Give it structure to facilitate processing <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.

Give it structure to facilitate processing <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 3

That’s parsing! Parsing is taking a flat (linear) sequence of items and adding structure

That’s parsing! Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar. 4

Parsing <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0

Parsing <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> parse <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 5

From the book: “Parsing Techniques” • Parsing is the process of structuring a linear

From the book: “Parsing Techniques” • Parsing is the process of structuring a linear representation in accordance with a given grammar. • The “linear representation” may be: • • A flat sequence of XML elements a sentence a computer program a knitting pattern a sequence of geological strata a piece of music actions of ritual behavior 6

Grammar • A grammar is a succinct description of the structure. • Here is

Grammar • A grammar is a succinct description of the structure. • Here is a grammar for Books: Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text 7

Parsing Linear representation <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date>

Parsing Linear representation <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> Grammar Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Structured representation parser <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 8

Parsing Techniques • Over the last 50 years many parsing techniques have been created.

Parsing Techniques • Over the last 50 years many parsing techniques have been created. • Some parsing techniques work from the starting grammar rule to the bottom. These are called top-down parsing techniques. • Other parsing techniques work from the bottom grammar rules to the starting grammar rule. These are called bottom-up parsing techniques. • The following slides show to apply a powerful bottom-up parsing technique to the Books example. 9

What does “powerful” mean? • The previous slide said, … following slides show to

What does “powerful” mean? • The previous slide said, … following slides show to apply a powerful bottom-up parsing technique … • “Powerful” means the technique can be used with lots of grammars, i. e. , it can be used to generate lots of different structures. 10

Suppose we were to structure the XML from scratch. We might follow these steps:

Suppose we were to structure the XML from scratch. We might follow these steps: <Books> </Books> <Book> <Title>Parsing Techniques</Title> </Books> <Book> <Title>Parsing Techniques</Title> <Authors> continued on next slide </Authors> </Books> 11

Follow these steps (cont. ): <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> </Authors> </Book>

Follow these steps (cont. ): <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> </Authors> </Book> </Books> <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> </Book> continued on next slide </Books> 12

Follow these steps (cont. ): <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel

Follow these steps (cont. ): <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> </Book> </Books> <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Dover Publications</Publisher> </Book> </Books> and so forth, filling in the second Book then the third Book 13

Last step: add the last Book’s Publisher <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author>

Last step: add the last Book’s Publisher <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> </Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> last step adds this 14

Alternate view of the steps (a tree view) Books Books Book Book Title Authors

Alternate view of the steps (a tree view) Books Books Book Book Title Authors Author Title continued on next slide Authors Author 15

Alternate view (cont. ) Title Books Book Authors Date Author Title Authors Date Author

Alternate view (cont. ) Title Books Book Authors Date Author Title Authors Date Author ISBN Title Authors Date continued on next slide ISBN Publisher Author 16

Alternate view (cont. ) Books Book Title Authors Date ISBN Publisher and so forth,

Alternate view (cont. ) Books Book Title Authors Date ISBN Publisher and so forth, filling in the second Book then the third Book Author 17

Last step: add the last Book’s Publisher Books Book Title Authors Author Date Book

Last step: add the last Book’s Publisher Books Book Title Authors Author Date Book ISBN Publisher Title Authors Date ISBN Publisher Title Date ISBN Authors last step adds this Book Title Authors Author Date Book ISBN Publisher Title Authors Author Date ISBN Publisher 18

Terminology: Production Step <Books> </Books> <Book> <Title>Parsing Techniques</Title> </Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Book>

Terminology: Production Step <Books> </Books> <Book> <Title>Parsing Techniques</Title> </Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Book> </Books> Each step is called a production step 21

Top down The previous slides showed the generation of the structured XML by starting

Top down The previous slides showed the generation of the structured XML by starting from the top (root element) down to the bottom (leaf nodes). 19

Bottom-up parsing In bottom-up parsing we work backward: from the last step to the

Bottom-up parsing In bottom-up parsing we work backward: from the last step to the first step. 20

Let’s begin … • One production step must have been the last and its

Let’s begin … • One production step must have been the last and its result must be visible in the linear representation. • We recognize the rule Publisher → text in This gives us the final step in the production process (and the first step in bottom-up parsing): <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 22

Next We recognize the rule ISBN → text in This gives us the next-to-last

Next We recognize the rule ISBN → text in This gives us the next-to-last step in the production process (and the second step in bottom-up parsing): <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 23

Next We recognize the rule Date → text in This gives us the third

Next We recognize the rule Date → text in This gives us the third step in bottom-up parsing: <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 24

Next We recognize the rule Author → text in This gives us the fourth

Next We recognize the rule Author → text in This gives us the fourth step in bottom-up parsing: <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 25

Next We recognize the rule Authors → Author+ in This gives us the fifth

Next We recognize the rule Authors → Author+ in This gives us the fifth step in bottom-up parsing: <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 26

Next We recognize the rule Title → text in This gives us the sixth

Next We recognize the rule Title → text in This gives us the sixth step in bottom-up parsing: <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 27

Next We recognize the rule Book → Title Authors Date ISBN Publisher in This

Next We recognize the rule Book → Title Authors Date ISBN Publisher in This gives us the seventh step in bottom-up parsing: <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Book> 28

See the algorithm? See how we are working backwards, from the bottom grammar rules

See the algorithm? See how we are working backwards, from the bottom grammar rules up to the starting grammar rule? In the process we are adding structure to the flat (linear) XML – neat! 29

Terminology: Reduction • In bottom-up parsing a collection of symbols are recognized as derived

Terminology: Reduction • In bottom-up parsing a collection of symbols are recognized as derived from a symbol. For example, Title, Authors, Date, ISBN, Publisher is derived from Book: Book Title Authors Date ISBN Publisher • Title, Authors, Date, ISBN, Publisher is reduced to Book • So the bottom-up parsing process is a reduction process. 30

Build your own bottom up parser! You now have enough knowledge that you can

Build your own bottom up parser! You now have enough knowledge that you can go off and build your own bottom-up parser. 31

I implemented a bottom-up parser • I used XSLT to implement a bottom-up parser.

I implemented a bottom-up parser • I used XSLT to implement a bottom-up parser. • If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document: • http: //www. xfront. com/parsing-techniques/bottom-up-parser/bottom-upparser-for-Books. xsl • http: //www. xfront. com/parsing-techniques/bottom-up-parser/Books. xml 32