Recursive Descent Parsing for XML Developers Roger L

Table of Contents • Introduction to parsing in general, recursive descent parsing in particular

Flat XML Document You might receive an XML document that has no structure. For

Give it structure to facilitate processing <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.

That’s parsing! Parsing is taking a flat (linear) sequence of items and adding structure

Parsing <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0

From the book: “Parsing Techniques” • Parsing is the process of structuring a linear

Grammar • A grammar is a succinct description of the structure. • Here is

Parsing Linear representation <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date>

Alternate view of the parser output Grammar Books → Book+ Book → Title Authors

Parsing Techniques • Over the last 50 years many parsing techniques have been created.

Terminology: Token • A token is an atomic (indivisible) unit. • Each item in

The input consists of a sequence of tokens Each of these are tokens. This

After parsing the tokens will be leaf nodes <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick

Another view of the tokens, after parsing Books Book Title Parsing Techniques Author Dick

Parsing structures the input by wrapping the tokens in non-terminal symbols <Books> <Book> <Title>Parsing

Recursive descent parsing works like this: • Start at the grammar’s start symbol and

Initial Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+

Output the start symbol Books → Book+ Book → Title Authors Date ISBN Publisher

Grammar says there must be at least one Books → Book+ Book → Title

Output <Book> Books → Book+ Book → Title Authors Date ISBN Publisher Authors →

Grammar says the token in the input stream must be Title Books → Book+

Grammar: after Title must be Authors Books → Book+ Book → Title Authors Date

Output <Authors> Books → Book+ Book → Title Authors Date ISBN Publisher Authors →

Grammar says the next token in the input. Yea, the input token matches the

Grammar says the next token in the input Another Author stream may be an

The next token in the input stream is not an Author token Books →

Grammar says the input stream token must be a Date token Books → Book+

Grammar says the input stream token must be an ISBN token Books → Book+

Grammar says the input stream token must be a Publisher token Books → Book+

We’ve completed structuring the first 6 input tokens Books → Book+ Book → Title

Completed the Book rule Books → Book+ Book → Title Authors Date ISBN Publisher

Begin work on structuring the next Books → Book+ Book → Title Authors Date

Implementation The following slides show, in a step-by-step manner, how to implement a recursive

Step 1 Create a function for each non-terminal symbol in the grammar: Functions Books()

Step 2 Create a global element, Token, that is used to identify the current

Step 3 Create a function, get_next_token(). When it is called, it increments the current

Step 4 Create a function, token(), and pass it a name, tk. The purpose

Example of using the token() function Suppose that during recursive descent parsing the grammar

Example (cont. ) The token() function determines that there is a match, so it

The token() function token(string tk) { if (tk != input[position() = Token]) then return

Motivation for Step 5 Suppose that during recursive descent parsing we are in the

Step 5 Create a function, require(), and pass it a token, found. If the

Step 6 Create an error function, error(). Pass it a string. It outputs the

The complete implementation • Recursive descent has been around a long time and people

$Token = 0 main() { get_next_token() require(input()) } Code for a Recursive Descent Parser$

$optional_additional_Authors() { author = token(‘Author') if exists(author) then return (author, optional_additional_Authors()) } token(string tk)$

XSLT Implementation • I created an XSLT implementation. I tried to mirror the beautiful

Richer example • The Books example shown on the previous slides was fine for

Expressions • Let’s parse a simple expression language that has these tokens: IDENTIFIER, addition,

Expression grammar input → expression Eo. F expression → term rest_expression term → IDENTIFIER

Parse tree for: IDENTIFIER Eo. F input expression Eo. F term rest_expression IDENTIFIER ε

Parser selects the first alternative input expression Eo. F term rest_expression IDENTIFIER ε term

Parse tree for: (IDENTIFIER) Eo. F input expression Eo. F term rest_expression parenthesized_expression ε

Parser selects the second alternative input expression Eo. F term rest_expression parenthesized_expression ε (

Question How does a recursive descent parser know that it should select the first

Answer • The parser doesn’t know. • It tries the first alternative. If that

Processing the first token in the input stream 1 2 input expression Input tokens:

Implementation of the term() function term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then

term() function (cont. ) term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return

Let’s represent each expression as XML Instead of this input: IDENTIFIER Eo. F our

XML representation (cont. ) Instead of this input: (IDENTIFIER) Eo. F our input will

XML representation (cont. ) Instead of this input: IDENTIFIER + IDENTIFIER Eo. F our

XML representation (cont. ) Instead of this input: IDENTIFIER + (IDENTIFIER + IDENTIFIER) Eo.

Parsing input → expression Eo. F expression → term rest_expression term → IDENTIFIER |

Parsing (cont. ) input → expression Eo. F expression → term rest_expression term →

Parsing (cont. ) <input> <IDENTIFIER /> <PLUS /> <LP /> <IDENTIFIER /> <PLUS />

XSLT Implementation • I created an XSLT implementation of a recursive descent parser for

Limitations of Recursive Descent Parsers Recall that in a rule containing alternatives we tried

Limitations (cont. ) Recursive descent parsers can’t handle left-recursive grammar rules. The parser goes

Limitations (cont. ) Suppose we add an array element as a term: term →

References – Great Books Modern Compiler Design (http: //www. amazon. com/Modern-Compiler -Design-Dick. Grune/dp/1461446988/ref=sr_1_1? s=books&ie=UTF

Slides: 74

Download presentation

Recursive Descent Parsing for XML Developers Roger L. Costello 15 October 20141

Table of Contents • Introduction to parsing in general, recursive descent parsing in particular • Example #1: How to do recursive descent parsing on Book data • Example #2: How to do recursive descent parsing for a grammar that contains alternatives • Limitations of recursive descent parsing 2

Flat XML Document You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data: <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </input> 2

Give it structure to facilitate processing <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </input> <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 3

That’s parsing! Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar. 4

Parsing <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </input> parse <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 5

From the book: “Parsing Techniques” • Parsing is the process of structuring a linear representation in accordance with a given grammar. • The “linear representation” may be: • • a flat sequence of XML elements a sentence a computer program a knitting pattern a sequence of geological strata a piece of music actions of ritual behavior 7

Grammar • A grammar is a succinct description of the structure. • Here is a grammar for Books: Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text 7

Parsing Linear representation <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </input> Grammar Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Structured representation parser <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 8

Alternate view of the parser output Grammar Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Linear representation <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </input> parser Books Book Title Parsing Techniques Author Dick Grune Date 2007 Ceriel J. H. Jacobs Parse tree Book ISBN Publisher 978 -0 -387 -20248 -8 Springer Title Introduction to Graph Theory Authors Author Richard J. Trudeau Date 1993 ISBN Publisher Title Authors 0 -486 -67870 -9 Dover Publications Introduction to Formal Languages Date 2012 Author Gyorgy E. Revesz ISBN Publisher 0 -486 -66697 -2 Dover Publications 8

Parsing Techniques • Over the last 50 years many parsing techniques have been created. • Some parsing techniques work from the starting grammar rule to the bottom. Those are called top-down parsing techniques. • Other parsing techniques work from the bottom grammar rules to the starting grammar rule. Those are called bottom-up parsing techniques. • The following slides explain the “recursive descent parsing technique. ” It is a top-down parsing technique. 9

Terminology: Token • A token is an atomic (indivisible) unit. • Each item in the input is a token. • After parsing the tokens will be leaf nodes. 12

The input consists of a sequence of tokens Each of these are tokens. This input consists of 16 tokens. <input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </input> 13

After parsing the tokens will be leaf nodes <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> tokens (terminal symbols) 14

Another view of the tokens, after parsing Books Book Title Parsing Techniques Author Dick Grune Date 2007 Ceriel J. H. Jacobs Book ISBN Publisher 978 -0 -387 -20248 -8 Springer Title Introduction to Graph Theory Authors Author Richard J. Trudeau Date 1993 ISBN Publisher Title Authors 0 -486 -67870 -9 Dover Publications Introduction to Formal Languages Date 2012 ISBN Publisher 0 -486 -66697 -2 Dover Publications Author Gyorgy E. Revesz 15

Parsing structures the input by wrapping the tokens in non-terminal symbols <Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> </Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 16

Recursive descent parsing works like this: • Start at the grammar’s start symbol and output it. In our grammar, the start symbol is <Books>, so output it. • Progress through each grammar rule. For a non-terminal symbol, output it. For a terminal symbol (i. e. , token), check the token in the input stream for match with the terminal symbol; if it matches, output it. 17

Initial Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> Start with the grammar’s start symbol and the first token in the input stream. 7

Output the start symbol Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> </Books> 19

Grammar says there must be at least one Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text So the input stream must contain all the tokens for at least one Book. Let’s process the grammar rule for Book. 20

Output <Book> Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> </Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 21

Grammar says the token in the input stream must be Title Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Title>Parsing Techniques</Title> <Book> </Books> Yea, the input token matches the grammar rule <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 22

Grammar: after Title must be Authors Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text So the input stream must contain Author tokens. Let’s process the rule for Authors. 23

Output <Authors> Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Authors> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> <Authors> <Book> </Books> 24

Grammar says the next token in the input. Yea, the input token matches the stream must be an Author token grammar rule Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Authors> <Author>Dick Grune</Author> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> <Authors> <Book> </Books> 25

Grammar says the next token in the input Another Author stream may be an Author token match Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Authors> <Book> </Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 26

The next token in the input stream is not an Author token Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> So, return to the caller (i. e. , return to the Book rule). 27

Grammar says the input stream token must be a Date token Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Authors> <Date>2007</Date> <Book> </Books> Yea, the input token matches the grammar rule <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 28

Grammar says the input stream token must be an ISBN token Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Book> </Books> Yea, the input token matches the grammar rule <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 29

Grammar says the input stream token must be a Publisher token Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Book> </Books> Yea, the input token matches the grammar rule <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 30

We’ve completed structuring the first 6 input tokens Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text Output: <Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Book> </Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 31

Completed the Book rule Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> We’ve finished processing the Book rule, so return to the caller (i. e. , the Books rule). 32

Begin work on structuring the next Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Title → text Author → text Date → text ISBN → text Publisher → text <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 33

Implementation The following slides show, in a step-by-step manner, how to implement a recursive descent parser 34

Step 1 Create a function for each non-terminal symbol in the grammar: Functions Books() { … } Books → Book+ Book → Title Authors Date ISBN Publisher Authors → Author+ Book() { … } Authors() { … } 35

Step 2 Create a global element, Token, that is used to identify the current position in the input stream. Initialize Token to 0: Token = 0 36

Step 3 Create a function, get_next_token(). When it is called, it increments the current position in the input stream: get_next_token() { Token = Token + 1 } 37

Step 4 Create a function, token(), and pass it a name, tk. The purpose of this function is to answer the question: “Does the token at the current position in the input stream match tk? ” 38

Example of using the token() function Suppose that during recursive descent parsing the grammar indicates that the next token in the input stream must be “Title. ” Suppose the global variable, Token, indicates that we are here in the input stream: <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 39

Example (cont. ) The token() function determines that there is a match, so it calls get_next_token() to increment the position in the input stream and returns the token: <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J. H. Jacobs</Author> return <Date>2007</Date> <ISBN>978 -0 -387 -20248 -8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0 -486 -67870 -9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0 -486 -66697 -2</ISBN> <Publisher>Dover Publications</Publisher> 40

The token() function token(string tk) { if (tk != input[position() = Token]) then return () else { get_next_token() return input[position() = Token]) } } Notice that token() returns empty if there is not a match. 41

Motivation for Step 5 Suppose that during recursive descent parsing we are in the Book() function. The Book() function first checks—by calling the token() function—to see if the current position of the input stream contains “Title. ” Suppose it does. Then, according to the grammar, there must be Authors, Date, ISBN, and then Publisher: Book → Title Authors Date ISBN Publisher 42

Step 5 Create a function, require(), and pass it a token, found. If the token is empty (i. e. , the token() function returned empty because there was not a match) then call the error() function. Otherwise, return the token. require(element found) { if empty(found) then error(‘Invalid input’) else return found } 43

Step 6 Create an error function, error(). Pass it a string. It outputs the string and then halts the parser. error(string s) { output s stop } 44

The complete implementation • Recursive descent has been around a long time and people have developed beautiful code for it. • The following two slides collects all the code from the previous slides. I recommend spending some time studying it to appreciate its beauty. 45

$Token = 0 main() { get_next_token() require(input()) } Code for a Recursive Descent Parser$

Token = 0 main() { get_next_token() require(input()) } Code for a Recursive Descent Parser input() { return require(Books()) } Books() { <Books> return (require(Book()), optional_additional_Books()) </Books> } optional_additional_Books() { book = Book() if exists(book) then return (book, optional_additional_Books()) } Book() { title = token('Title') if exists(title) then <Book> return (title, require(Authors(), require(token('Date')), require(token(‘ISBN')), require(token(‘Publisher')) </Book> } Authors() { <Authors> return (require(Author()), optional_additional_Authors()) </Authors> } 46

$optional_additional_Authors() { author = token(‘Author') if exists(author) then return (author, optional_additional_Authors()) } token(string tk)$

optional_additional_Authors() { author = token(‘Author') if exists(author) then return (author, optional_additional_Authors()) } token(string tk) { if (tk != input[position() = Token]) then return () else { get_next_token() return input[position() = Token]) } } require(element found) { if empty(found) then error(‘Invalid input’) else return found } get_next_token() { Token = Token + 1 } 47

XSLT Implementation • I created an XSLT implementation. I tried to mirror the beautiful code shown on the previous slides. • If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document: • http: //www. xfront. com/parsing-techniques/recursive-descent-parser/booksparser. xsl • http: //www. xfront. com/parsing-techniques/recursive-descent-parser/bookstest. xml 48

Richer example • The Books example shown on the previous slides was fine for introducing recursive descent parsing. • But it glossed over an important problem: grammar rules with alternatives. • The following example shows how to do recursive descent parsing with a grammar that has alternatives. 49

Expressions • Let’s parse a simple expression language that has these tokens: IDENTIFIER, addition, parentheses, and Eo. F. • Here a few examples of expressions: IDENTIFIER Eo. F (IDENTIFIER) Eo. F IDENTIFIER + IDENTIFIER Eo. F (IDENTIFIER + IDENTIFIER) Eo. F IDENTIFIER + (IDENTIFIER + IDENTIFIER) Eo. F (IDENTIFIER + IDENTIFIER) + IDENTIFIER Eo. F IDENTIFIER + (IDENTIFIER + IDENTIFIER)) Eo. F Each expression ends with an end-of-file (Eo. F) token. 50

Expression grammar input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε 51

Parse tree for: IDENTIFIER Eo. F input expression Eo. F term rest_expression IDENTIFIER ε input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε 52

Parser selects the first alternative input expression Eo. F term rest_expression IDENTIFIER ε term has two alternatives. The parser selected the first alternative. input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε 53

Parse tree for: (IDENTIFIER) Eo. F input expression Eo. F term rest_expression parenthesized_expression ε ( expression ) term rest_expression IDENTIFIER ε input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε 54

Parser selects the second alternative input expression Eo. F term rest_expression parenthesized_expression ε ( expression term IDENTIFIER ) rest_expression ε term’s second alternative is selected input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε 55

Question How does a recursive descent parser know that it should select the first or second alternative? term → IDENTIFIER | parenthesized_expression How does the parser know which alternative to select? 56

Answer • The parser doesn’t know. • It tries the first alternative. If that fails it tries the second alternative (i. e. , the parser backtracks and tries the next alternative). It repeats until it finds an alternative that succeeds. 57

Processing the first token in the input stream 1 2 input expression Input tokens: ( IDENTIFIER ) Eo. F term 3 IDENTIFIER Try the first alternative, which says the input token must be IDENTIFIER. However, the input token is ( so we must back up and try the next alternative input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε 58

Implementation of the term() function term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression())) </term> } Check the current token in the input stream to see if it is IDENTIFIER. 59

term() function (cont. ) term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression())) </term> } If there is a match, return the token. 60

Let’s represent each expression as XML Instead of this input: IDENTIFIER Eo. F our input will be this: <input> <IDENTIFIER /> <Eo. F /> </input> 62

XML representation (cont. ) Instead of this input: (IDENTIFIER) Eo. F our input will be this: <input> <LP /> <IDENTIFIER /> <RP /> <Eo. F /> </input> 63

XML representation (cont. ) Instead of this input: IDENTIFIER + IDENTIFIER Eo. F our input will be this: <input> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <Eo. F /> </input> 64

XML representation (cont. ) Instead of this input: IDENTIFIER + (IDENTIFIER + IDENTIFIER) Eo. F our input will be this XML input: <input> <IDENTIFIER /> <PLUS /> <LP /> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <RP /> <Eo. F /> </input> 65

Parsing input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε <input> <IDENTIFIER /> <Eo. F /> </input> Parser <output> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> <Eo. F/> </output> 66

Parsing (cont. ) input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε <input> <LP /> <IDENTIFIER /> <RP /> <Eo. F /> </input> Parser <output> <expression> <term> <parenthesized_expression> <LP/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> <RP/> </parenthesized_expression> </term> <rest_expression/> </expression> <Eo. F/> </output> 67

Parsing (cont. ) input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε <input> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <Eo. F /> </input> Parser <output> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> </rest_expression> </expression> <Eo. F/> </output> 68

Parsing (cont. ) <input> <IDENTIFIER /> <PLUS /> <LP /> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <RP /> <Eo. F /> </input> input → expression Eo. F expression → term rest_expression term → IDENTIFIER | parenthesized_expression → '(' expression ')' rest_expression → '+' expression | ε Parser <output> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <parenthesized_expression> <LP/> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> </rest_expression> </expression> <RP/> </parenthesized_expression> </term> <rest_expression/> </expression> </rest_expression> </expression> <Eo. F/> </output> 69

XSLT Implementation • I created an XSLT implementation of a recursive descent parser for the expression language. • If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document: • http: //www. xfront. com/parsing-techniques/recursive-descentparser/expression-parser. xsl • http: //www. xfront. com/parsing-techniques/recursive-descentparser/expression-test. xml 70

Limitations of Recursive Descent Parsers Recall that in a rule containing alternatives we tried the first alternative, if it failed we backtracked and tried the second alternative. Searching the alternatives is time-consuming. 71

Limitations (cont. ) Recursive descent parsers can’t handle left-recursive grammar rules. The parser goes into an infinite loop. Example: suppose the grammar has this rule: expression → expression '-' term That is a “left-recursive” rule: on the rule’s right-hand side it starts with the same symbol as on the left-hand side (i. e. , expression). The recursive descent routine for this rule is: expression() { return expression() and require(token(‘-’)) and require(term) } (infinite) recursion! 72

Limitations (cont. ) Suppose we add an array element as a term: term → IDENTIFIER | indexed_element | parenthesized_expression indexed_element → IDENTIFIER '[' expression ']' and create a recursive descent parser for the new grammar. The routine for indexed_element will never be tried: when the sequence IDENTIFIER '[' occurs in the input, the first alternative of term will succeed, consume the identifier, and leave the indigestible part '[' expression ']' in the input. 73

References – Great Books Modern Compiler Design (http: //www. amazon. com/Modern-Compiler -Design-Dick. Grune/dp/1461446988/ref=sr_1_1? s=books&ie=UTF 8&qid=141340845 8&sr=1 -1&keywords=modern+compiler+design) Parsing Techniques (http: //www. amazon. com/Parsing-Techniques. Practical-Monographs. Computer/dp/038720248 X/ref=sr_1_1? s=books&ie=UTF 8&qid=141340 8496&sr=1 -1&keywords=parsing+techniques) 74