A Long Introduction to Ant LR Slides adapted

Ant. LR ANother Tool for Language Recognition (or anti-LR? ? ) a LL(k) parser

Tasks Divided • • Lexical Analysis (scanning) Semantic Analysis (parsing) Tree Generation Code Generation

Lexer A source file is streamed to a lexer on a character by character

Parser organizes the tokens into the allowed sequences defined by the grammar of the

What does a grammar file look like? It is composed of rules ANTLR accepts

Sample File 7 taken from Ant. LR tutorial of Ashley J. S Mills

Sample File Divided (1/3) • An arbitrary number of parsers, lexers, and treeparsers in

Sample File Divided (2/3) • Options – file-wide – char. Vocabulary = '�'. .

Sample File Divided (3/3) • Rules in EBNF notation: taken from Ant. LR tutorial

Symbols in Ant. LR taken from Ant. LR reference manual 11

Lexer taken from Ant. LR tutorial of Ashley J. S Mills With one restriction:

Lexer Rules You can define operators like: BECOMES : COLON : SEMI : EQUALS

Actions Blocks of source code (expressed in the target language) enclosed in curly braces

Tip: Skipping Tokens A white space has nothing to do in a grammar: WS

Tip: Newline Stuff Line number of input is used for reporting error Must be

Parser class Expr. Parser extends Parser; expr: mexpr ((PLUS|MINUS) mexpr)* ; mexpr : atom

Tip: Keywords and Literals (1/2) Many languages have a general "identifier" lexical rule, and

Tip: Keywords and Literals (2/2) option test. Literals By default, ANTLR will generate code

Tip: Token Object Creation You will sometimes want to access information about the token

Tip: Syntactic / Semantic Predicates There are other situations where you have to turn

Syntactic Predicates ANTLR (tree) parsers usually use only a single symbol of lookahead, which

Semantic Predicates Semantic predicates – at the start of an alternative: decides whether or

Eg: Keeping State Information Context-sensitive recognition example: If you are matching tokens that separate

The Java Code The code to invoke the parser: import java. io. *; class

Running Ant. LR In Linux runantlr <antlr_file>. g javac *. java Main In Windows

Expression Evaluation 1: Syntax-Directed Translation To evaluate the expressions on the fly as the

Expression Evaluation 2: via AST Intermediate Form A more powerful strategy than syntax-directed translation

Abstract Syntax Trees Abstract Syntax Tree: Like a parse tree, without unnecessary information Two-dimensional

AST Construction To get ANTLR to generate a useful AST : – turn on

AST Operators AST root operator Normally Ant. LR makes the first token it encounters

AST Parsing and Evaluation Rule format is like #(A B C); which means "match

in Java The code to launch the parser and the tree walker: import java.

AST Construction by Hand In some cases, you may want to transfom a tree

in Java The code to launch the parser and tree trasformer is: import java.

Left Recursion Solved E→E+T|T written in Ant. LR as expr: expr PLUS term |

Links • Ant. LR Reference Manual by Terence Pratt antlr. org/share/1084743321127/ANTLR_Reference_Manual. pdf • Ant.

Slides: 37

Download presentation

A (Long) Introduction to Ant. LR Slides adapted from: –Ant. LR Reference Manual by Terence Pratt antlr. org/share/1084743321127/ANTLR_Reference_Manual. pdf –Ant. LR Tutorial by Ashley J. S Mills http: //supportweb. cs. bham. ac. uk/docs/tutorials/docsystem/build/tutorials/antlrhome. html –An Introduction to Ant. LR by Terence Pratt http: //www. cs. usfca. edu/~parrt/course/652/lectures/antlr. html –An Ant. LR Tutorial by Scott Stanchfield javadude. com/articles/antlrtut/ 2/22/2021 Hande Çelikkanat 1

Ant. LR ANother Tool for Language Recognition (or anti-LR? ? ) a LL(k) parser and translator generator tool which can create – lexers – parsers – abstract syntax trees (AST’s) in which you describe the language grammatically and in return receive a program that can recognize and translate that language 2

Tasks Divided • • Lexical Analysis (scanning) Semantic Analysis (parsing) Tree Generation Code Generation 3

Lexer A source file is streamed to a lexer on a character by character basis by some kind of input interface. Lexer groups characters into meaningful tokens that are meaningful to the parser. A “token” may be – – keywords identifiers symbols operators Lexer also removes comments and whitespace from the program, which are meaningless to the parser. So it creates a stream of tokens, which are received one by the parser. 4

Parser organizes the tokens into the allowed sequences defined by the grammar of the language. If the parser encounters a sequence of tokens that match none of the allowed sequences of tokens, it will issue an error A design choice is whether to try to recover from the error by making assumptions. Parsers may either do syntax-directed translation on-the-fly, or convert the sequences of tokens into an Abstract Syntax Tree (AST). An AST is a structure which – keeps information in an easily traversable form (such as operator at a node, operands at children of the node) – ignores form-dependent superficial details More on AST’s later. . . Parser also generates one or more symbol table(s) which contain information, about the tokens it encounters. 5

What does a grammar file look like? It is composed of rules ANTLR accepts three types of grammar specifications parsers lexers tree-parsers (also called tree-walkers) Uses LL(k) analysis for all So the grammar specifications are similar, and the generated lexers and parsers behave similarly 6

Sample File 7 taken from Ant. LR tutorial of Ashley J. S Mills

Sample File Divided (1/3) • An arbitrary number of parsers, lexers, and treeparsers in a grammar file – a separate class file will be generated for each – i. e, Your. Lexer. Class. class, Your. Parser. Class. class, Your. Tree. Parser. Class. class • Header: – put preamble that will be put on top of each of these classes – an import, maybe? 8

Sample File Divided (2/3) • Options – file-wide – char. Vocabulary = ''. . '377'; //defines the alphabet (usage in complement and wildcard) – k=2; // means two characters of lookahead • Class specific: {. . . header for parser class only. . . } class My. Parser extends Parser; options {. . . parser options. . . } { parser class members } parser rules 9

Sample File Divided (3/3) • Rules in EBNF notation: taken from Ant. LR tutorial of Ashley J. S Mills You simply list a set of lexical rules that match tokens. The tool automatically generates code to map the next input character(s) to a rule likely to match. A big "switch“ that routes recognition flow to the appropriate rule 10

Symbols in Ant. LR taken from Ant. LR reference manual 11

Lexer taken from Ant. LR tutorial of Ashley J. S Mills With one restriction: • Rules defined within a lexer grammar must have a name beginning with an uppercase letter 12

Lexer Rules You can define operators like: BECOMES : COLON : SEMI : EQUALS : LBRACKET : RBRACKET : LPAREN : RPAREN : LTE : PLUS : MINUS : TIMES : DIV : “: =“; ‘: ‘; ‘; ’ ; ‘=‘ ; ‘[‘; ‘]’ ; ‘(‘ ; ‘)’ ; ‘<‘ ; “<=“ ; ‘+’ ; ‘-’ ; ‘*’ ; ‘/’ ; And then you can define a token class such as: OPS : (PLUS | MINUS | MULT | DIV) ; 13

Actions Blocks of source code (expressed in the target language) enclosed in curly braces Executed after the preceding production element has been recognized before the recognition of the following element Typically used to generate output, construct trees, or modify a symbol table Position dictates when it is recognized relative to the surrounding grammar elements. If the first element of a production, it is executed before any other element in that production, but only if that production is predicted by the lookahead rule_name ( {init-action}: {action of 1 st production} production_1 | {action of 2 nd production} production_2 )? The init-action would be executed regardless of what (if anything) matched in the optional subrule. The init-actions are placed within the loops generated for subrules (. . . )+ and (. . . )*. 14

Tip: Skipping Tokens A white space has nothing to do in a grammar: WS : (‘ ‘ | ‘n’ | ‘t’) { $set. Type(Token. SKIP); } ; → action → Do not pass this token to the parser. Recognize it and then throw it away. Same for comments ; ) 15

Tip: Newline Stuff Line number of input is used for reporting error Must be incremented by hand when lexer encounters a newline: WS : ( ' ' | 't' | 'f' // handle newlines |( "rn" // DOS/Windows | 'r' // Macintosh | 'n' // Unix ) // increment the line count { newline(); } ) { $set. Type(Token. SKIP); } ; → action executed only in this case 16

Parser class Expr. Parser extends Parser; expr: mexpr ((PLUS|MINUS) mexpr)* ; mexpr : atom (STAR atom)* ; atom: INT | LPAREN expr RPAREN ; • Rules defined within a parser grammar must have a name beginning with a lowercase letter 17

Tip: Keywords and Literals (1/2) Many languages have a general "identifier" lexical rule, and keywords that are special cases of the identifier pattern A typical identifier token may be defined as: ID : LETTER (LETTER | DIGIT)*; So how can Ant. LR understand “if” is not an identifier? You put fixed keywords into a literals table. checked after each token is matched Any double-quoted string used in a parser is automatically entered into the literals table of the associated lexer. subprogram. Body : (basic. Decl)* (procedure. Decl)* "begin" (statement)* "end" IDENT ; 18

Tip: Keywords and Literals (2/2) option test. Literals By default, ANTLR will generate code in all lexer rules to test each token against the literals table However, you may suppress this code generation in the lexer by using a grammar option: class L extends Lexer; options { test. Literals=false; }. . . If you turn this option off for a lexer, you may re-enable it for specific rules ID options { test. Literals=true; } : LETTER (LETTER | DIGIT)*; 19

Tip: Token Object Creation You will sometimes want to access information about the token being matched Label lexical rules and obtain a Token object representing the text, token type, line number, etc. . . matched for that rule reference Lexer rule: INT : ('0'. . '9')+ ; Parser rule: INDEX : '[' i: INT ']' {System. out. println(i. get. Text()); } ; 20

Tip: Syntactic / Semantic Predicates There are other situations where you have to turn on and off certain rules depending on prior context or semantic information Use “predicates” to decide 21

Syntactic Predicates ANTLR (tree) parsers usually use only a single symbol of lookahead, which is normally not a problem as intermediate forms are explicitly designed to be easy to walk However, there is occasionally the need to distinguish between similar tree structures Syntactic predicates can be used to overcome the limitations of limited fixed lookahead For example, distinguishing between the unary and binary minus operator: expr: ( #(MINUS expr) )=> #( MINUS expr ) | #( MINUS expr ). . . ; The order of evaluation is very important as the second alternative is a "subset" of the first alternative Syntactic predicates are a form of selective backtracking and, therefore, actions are turned off while evaluating a syntactic predicate so that actions do not have to be undone 22

Semantic Predicates Semantic predicates – at the start of an alternative: decides whether or not to match – in the middle of productions: throw exceptions when they evaluate to false stat: {is. Type. Name(LT(1))}? ID ID "; “ | ID "=" expr "; " ; decl: "var" ID ": " t: ID { is. Type. Name(t. get. Text()) }? ; // declaration "type var. Name; " // assignment //used to throw an exception 23

Eg: Keeping State Information Context-sensitive recognition example: If you are matching tokens that separate rows of data such as "----", you probably only want to match this if the "begin table" sequence has been found BEGIN_TABLE : '[' {this. in. Table=true; } // enter table context ; ROW_SEP : {this. in. Table}? "----“ // sematic predicate ; END_TABLE : ']' {this. in. Table=false; } // exit table context ; 24

The Java Code The code to invoke the parser: import java. io. *; class Main { public static void main(String[] args) { try { // use Data. Input. Stream to grab bytes My. Lexer lexer = new My. Lexer(new Data. Input. Stream(System. in)); My. Parser parser = new My. Parser(lexer); int x = parser. expr(); System. out. println(x); } catch(Exception e) { System. err. println("exception: "+e); } } } 25

Running Ant. LR In Linux runantlr <antlr_file>. g javac *. java Main In Windows Eclipse has a very easy-to-use plugin for Ant. LR http: //antlreclipse. sourceforge. net/ for very detailed instructions The plugin will run Ant. LR on the grammar file 26

Expression Evaluation 1: Syntax-Directed Translation To evaluate the expressions on the fly as the tokens come in, add actions to the parser: class Expr. Parser extends Parser; expr returns [int value=0] {int x; } : value=mexpr ( PLUS x=mexpr {value += x; } | MINUS x=mexpr {value -= x; } )* ; mexpr returns [int value=0] {int x; } : value=atom ( STAR x=atom {value *= x; } )* ; atom returns [int value=0] : i: INT {value=Integer. parse. Int(i. get. Text()); } | LPAREN value=expr RPAREN ; 27

Expression Evaluation 2: via AST Intermediate Form A more powerful strategy than syntax-directed translation is to build an AST: intermediate representation that holds all or most of the input symbols and has encoded, in the structure of the data, the relationship between those tokens For this kind of tree, you will use a tree walker to compute the same values as before, but using a different strategy The utility of ASTs becomes clear when you must do multiple walks over the tree to figure out what to compute or to do tree rewrites, morphing the tree towards another language. 28

Abstract Syntax Trees Abstract Syntax Tree: Like a parse tree, without unnecessary information Two-dimensional trees that can encode the structure of the input as well as the input symbols Either homogeneous: all objects of the same type; e. g. , Common. AST in ANTLR or heterogeneous: multiple types such as Plus. Node, Mult. Node. . . An AST for (3+4) might be represented as No parantheses are included in the tree! 29

AST Construction To get ANTLR to generate a useful AST : – turn on the build. AST option – add a few suffix operators class Expr. Parser extends Parser; options { build. AST=true; } expr: mexpr ((PLUS^|MINUS^) mexpr)* ; mexpr : atom (STAR^ atom)* ; atom: INT | LPAREN! expr RPAREN! ; No changes in the Lexer. 30

AST Operators AST root operator Normally Ant. LR makes the first token it encounters the root of the tree We usually want to manipulate this, eg, for operators A token suffixed with the “^” root operator forces that token as the root of the current tree: expr: mexpr ((PLUS^|MINUS^) mexpr)* ; AST exclude operator. Tokens / rule references suffixed with the exclude operator are not included in the AST eg, for parantheses: atom: INT | LPAREN! expr RPAREN! ; 31

AST Parsing and Evaluation Rule format is like #(A B C); which means "match a node of type A, and then descend into its list of children and match B and C". This notation can be nested arbitrarily, using #(. . . ) for child trees eg, #(A B #(C D) ); class Expr. Tree. Parser extends Tree. Parser; expr returns [int r=0] { int a, b; } : #(PLUS a=expr b=expr) {r = a+b; } | #(MINUS a=expr b=expr) {r = a-b; } | #(STAR a=expr b=expr) {r = a*b; } | i: INT {r = (int)Integer. parse. Int(i. get. Text()); } ; Important: Sufficient matches are not exact matches. As long as the tree satistfies the pattern, a match is reported, regardless of how much is left unparsed #( A B ) = #( A #(B C) D). 32

in Java The code to launch the parser and the tree walker: import java. io. *; import antlr. Common. AST; import antlr. collections. AST; class Calc { public static void main(String[] args) { try { Calc. Lexer lexer = new Calc. Lexer(new Data. Input. Stream(System. in)); Calc. Parser parser = new Calc. Parser(lexer); parser. expr(); // Parse the input expression Common. AST t = (Common. AST)parser. get. AST(); System. out. println(t. to. String. List()); // Print the resulting tree out in LISP notation Calc. Tree. Walker walker = new Calc. Tree. Walker(); // Traverse the tree created by the parser int r = walker. expr(t); System. out. println("value is "+r); } catch(Exception e) { System. err. println("exception: "+e); } } } 33

AST Construction by Hand In some cases, you may want to transfom a tree yourself. eg, Optimization of addition with zero class Calc. Tree. Walker extends Tree. Parser; options{ build. AST = true; // "transform" mode } expr: ! #(PLUS left: expr right: expr) // '!' turns off auto transform { if ( #right. get. Type()==INT && Integer. parse. Int(#right. get. Text())==0 ) // x+0 = x { #expr = #left; } else if ( #left. get. Type()==INT && Integer. parse. Int(#left. get. Text())==0 ) // 0+x = x { #expr = #right; } else // x+y { #expr = #(PLUS, left, right); } } | #(STAR expr) // use auto transformation | i: INT ; 34

in Java The code to launch the parser and tree trasformer is: import java. io. *; import antlr. Common. AST; import antlr. collections. AST; class Calc { public static void main(String[] args) { try { Calc. Lexer lexer = new Calc. Lexer(new Data. Input. Stream(System. in)); Calc. Parser parser = new Calc. Parser(lexer); parser. expr(); // Parse the input expression Common. AST t = (Common. AST)parser. get. AST(); System. out. println(t. to. Lisp. String()); // Print the resulting tree out in LISP notation Calc. Tree. Walker walker = new Calc. Tree. Walker(); walker. expr(t); // Traverse the tree created by the parser t = (Common. AST)walker. get. AST(); // Get the result tree from the walker System. out. println(t. to. Lisp. String()); } catch(Exception e) { System. err. println("exception: "+e); } } } 35

Left Recursion Solved E→E+T|T written in Ant. LR as expr: expr PLUS term | term; The code generated checks for expr infinitely: expr() { expr(); match(PLUS); expr(); } Eliminate left recursion by E → TE’ E’ → +TE’ | ε results in: expr: term (PLUS term)* ; 36

Links • Ant. LR Reference Manual by Terence Pratt antlr. org/share/1084743321127/ANTLR_Reference_Manual. pdf • Ant. LR Tutorial by Ashley J. S Mills http: //supportweb. cs. bham. ac. uk/docs/tutorials/docsystem/build/tutorials/an tlr/antlrhome. html • An Introduction to Ant. LR by Terence Pratt http: //www. cs. usfca. edu/~parrt/course/652/lectures/antlr. html • An Ant. LR Tutorial by Scott Stanchfield javadude. com/articles/antlrtut/ 37