Syntax The Grammar of a Language 1 Topics

  • Slides: 55
Download presentation
Syntax The Grammar of a Language 1

Syntax The Grammar of a Language 1

Topics to Know Difference between syntax and semantics. q Four categories of languages. q

Topics to Know Difference between syntax and semantics. q Four categories of languages. q Regular expressions n used for pattern matching and extracting info n regular expressions are an essential part of many programming languages. . . memorize them! q BNF and EBNF n used to describe grammar of computer languages n can be used to automatically generate a parser for a language q 2

Syntax and Semantics q The syntax of a language defines the valid symbols and

Syntax and Semantics q The syntax of a language defines the valid symbols and grammar. n q q Syntax defines the structure of a program, i. e. , the form that each program unit and each statement must use. The semantics defines the meaning of the grammar elements. Lexical structure is the form of lowest level syntactic units (words or tokens) of a grammar. 3

Syntax and Semantics Compared q Syntax: in Java, an assignment statement is: identifier =

Syntax and Semantics Compared q Syntax: in Java, an assignment statement is: identifier = expression { operator expression } ; q Semantics: an assignment statement must use compatible types, e. g. int n 1, n 2; n 1 = 20*1024; n 2 = 3. 50; q // OK, int_var = int_expression // illegal, incompatible types Lexical elements (tokens): "n 2" "=" "3. 50" "; " 4

Syntax and Semantics Compared q Syntax: the form of a while statement is: while

Syntax and Semantics Compared q Syntax: the form of a while statement is: while ( boolean_expression ) statement ; q Semantics: when a thread of execution encounters a while statement the boolean expression is tested. If the expression evaluates to true then the statement is executed and the process is repeated. If [not when] the expression evaluates to false, execution continues to the next statement. . 5

How are they used? Program Source Code Parts of a Compiler / Interpreter: Tokenizer

How are they used? Program Source Code Parts of a Compiler / Interpreter: Tokenizer (Lexical Analysis) Token stream Parser (Syntax Analysis) Parse tree Semantic Analysis Intermediate code Optimization and Code Generation Object code 6

Scanning and Parsing source file input stream Tokenizer tokens Parser parse tree sum =

Scanning and Parsing source file input stream Tokenizer tokens Parser parse tree sum = x 1 + x 2; sum = x 1 + x 2 ; assignment: sum = + x 1 x 2 7

Scanners q Recognize regular expressions q Implemented as finite automata (finite state machines) q

Scanners q Recognize regular expressions q Implemented as finite automata (finite state machines) q Typically contain a loop that cycles through characters, building tokens and associated values by repeated operations q scanner may be integrated as a function in the parser. q Parser calls the Scanner to get the next token. 8

Parsers q Recognize patterns defined by grammar rules q Implemented as pushdown automata q

Parsers q Recognize patterns defined by grammar rules q Implemented as pushdown automata q Convert a stream of tokens (supplied by the scanner) into a parse tree containing symbols defined in the grammar. q Symbols are things like "assignment", "expression" q Parsing is more difficult than scanning. 9

Formal Languages q Famed linguist Noam Chomsky introduced a formal classification of (human) languages.

Formal Languages q Famed linguist Noam Chomsky introduced a formal classification of (human) languages. In terms of computing theory, his categories are: Hierarchy Grammars Languages Type 0 unrestricted Recursive Enumerable Minimal Automaton Turing machine Recursive Type 1 context-sensitive Context-sensitive Type 2 context-free Context-free Type 3 regular Regular Decider Linear-bounded Pushdown Deterministic Finite Each class in this hierarchy is a subset of the class above it. A context-free grammar is a grammar where the syntax of each constituent is independent of the symbols that come before and after it. 10

Applying Formal Languages (1) Tokenizer or Scanner (Lexical Analysis): q The lexemes (tokens) in

Applying Formal Languages (1) Tokenizer or Scanner (Lexical Analysis): q The lexemes (tokens) in a computer language are a "regular grammar" (Type 3). q Therefore, we can use the simplest grammar processor to make a tokenizer. q rules for tokens are defined using regular expressions. Examples: integer : : = "[+-]? [0 -9]+" (actually, this is too simple) integer : : = "[+-]? (0[0 -7]*|[1 -9][0 -9]*)" (better) identifier : : = "[A-Za-z_][A-Za-z_0 -9]*" 11

Applying Formal Languages (2) Syntax Analysis q The syntax of a computer language is

Applying Formal Languages (2) Syntax Analysis q The syntax of a computer language is a "Context-free Grammar". . . almost. q We can use a "type 2" grammar processor as a parser. q Rules defined in Backus-Naur Form and Extended BNF Example: expression : : = | | term : : = | | factor : : = expression + term expression - term * factor term / factor ( expression ) | NUMBER 12

Lexical Structure q Lexemes are the smallest lexical unit of a language, grouped according

Lexical Structure q Lexemes are the smallest lexical unit of a language, grouped according to syntactic usage. Some types of lexemes in computer languages are: identifiers: x, println, _INIT, Array. List numeric constants: 0, 10000, 2. 98 E+6 operators: =, +, -, ++, +=, *, / separators: [ ] ; : . , ( ) string literals: "hello there" A token is a string representing the value of a lexeme. q Lexemes are recognized by the first phase of a translator -- the scanner -- that deals directly with the input. The scanner separates the input into tokens. q Scanners are also called lexers. q 13

Tokens are the strings of syntactic units. q Example: what are the tokens in

Tokens are the strings of syntactic units. q Example: what are the tokens in this statement? result = (sum - average)/count; q Tokens: q result = ( sum average ) / count ; identifier assignment expression identifier arithmetic identifier expression arithmetic identifier semi-colon operator delimiter operator (statement delimiter) 14

C tokens q Lexical structure of C, defined in The C Programming Language. .

C tokens q Lexical structure of C, defined in The C Programming Language. . . “There are six classes of tokens: identifiers, keywords, constants, string literals, operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments as described below (collectively, "white space") are ignored except as they separate tokens. Some white space is required to separate otherwise adjacent identifiers, keywords, and constants. If the input stream has been separated into tokens up to a given character, the next token is the longest string of characters that could constitute a token. ” [Kernighan and Ritchie, The C Programming Language, 2 nd Ed. , pp. 191 -192. ] q C uses the principle of longest match (substring). 15

Principle of Longest Match A token should be the longest string that satisfies a

Principle of Longest Match A token should be the longest string that satisfies a rule for lexemes. q Example: x 2+=1. 0 tokens: "x 2" "+=" "1. 0" NOT: "x" "2" "+" "=" "1" ". " "0" q q q What are the tokens in these inputs? x = y+1; x = y+=1; // tokenizer cannot rely on optional spaces x = y == 1; if ( x++ = 1 ) get. First. Value( ); // probably an error Tokener should not use context or look-ahead more than 1 character. (FORTRAN is an exception) 16

Describing Lexical Structure q q In C, an identifier (i) begins with a letter

Describing Lexical Structure q q In C, an identifier (i) begins with a letter or _ (ii) followed by any number of letters, '_', or digits. Compare these two examples: which way is simpler? Rules for C identifiers using EBNF Rules: identifier : : = ( letter | _ ) { letter | _ | digit } letter : : = 'A' | 'B' | 'C' | 'D' |. . . | 'Z' | 'a' | 'b' | 'c' | 'd' |. . . | 'z' digit : : = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Rule for C identifiers using Regular Expressions: identifier : : = [A-Za-z_ ][A-Za-z_0 -9]* identifier : : = [w_][w_d]* 17

Rules for Regular Expressions q Regular expressions: x match an occurrence of x [abcd]

Rules for Regular Expressions q Regular expressions: x match an occurrence of x [abcd] match any one of these characters [A-Z] any one character from this range [A-Za-z_] any letter or _ x* [a-z]* 0 or more occurrences of these x+ [a-z]+ 1 or more occurrences of these x? [a-z]? exactly 0 or 1 occurrence x{5} match exactly 5 times, same as xxxxx x{4, 8} match between 4 and 8 times. (period) match any one character. * match anything! 18

Pattern Matching in Java java. util. regex. Matcher matches Strings using regular expressions (patterns)

Pattern Matching in Java java. util. regex. Matcher matches Strings using regular expressions (patterns) can be used to extract the thing that is matched java. util. regex. Pattern defines objects for regular expressions String. matches( regex ) tests whether the string values matches a regular exp. String. split( regex [, maxelements ] ) split a string everyplace the regex is found. 19

Using String matches( ) Java's String match( ) method and the Matcher class. "hi".

Using String matches( ) Java's String match( ) method and the Matcher class. "hi". matches("[a-z]+") true "you 2". matches("[a-z]+") false "9". matches("\d") true, d : = [0 -9] "4@com". matches(". +@. +") anything @ anything "away". matches("a. *y") true "ay". matches("a. +y") false, at least 1 "123". matches("-? \d+") true "--12". matches("-? \d+") false 20

Using String split( ) Split a string into pieces anyplace a white space is

Using String split( ) Split a string into pieces anyplace a white space is found String s = "split metat white space "; String [ ] word = s. split("\s+"); word[0]="split" word[1]="me" word[2]="at" word[3]="white" word[4]="space" 21

Character classes and special chars ntrfe newline, tab, return, formfeed, escape x 4 E

Character classes and special chars ntrfe newline, tab, return, formfeed, escape x 4 E character with hexadecimal value 4 E u 1234 character with Unicode value 1234 [^abcd] any character NOT (^) in this set. d D s S w W ^ must be FIRST CHAR after "[" any digit, same meaning as [0 -9] any non-digit, same as [^0 -9] any white space, [ tnrfx 08] any non-whitespace any word character, [a-z. A-Z 0 -9_] any non-word character, [^a-z. A-Z 0 -9_] (. . . ) pattern group 22

Examples using Character Classes q Match a valid student ID for this class: 4[6

Examples using Character Classes q Match a valid student ID for this class: 4[6 -9]d{6} q A C identifier: begins with a letter or underscore, followed by any number of letters, digits, or underscore [A-Za-z_]w* q Hello in Thai: u 0 E 2 Au 0 E 27u 0 E 31u 0 E 2 Au 0 E 14u 0 E 35 23

Positional Matching ^ $ b match beginning of a line match the end of

Positional Matching ^ $ b match beginning of a line match the end of a line match a word boundary Match a student ID at the beginning of the string, as a word by itself: Matcher m = Pattern. compile("^d{8}b. *"); m. match("47541234 Joe Hacker"); // match m. match("123456789 too long"); // no match m. match("My ID is 48541234"); // no match 24

Regular Expression for the Time Write a pattern to match a time string of

Regular Expression for the Time Write a pattern to match a time string of the form "hh: mm: ss am" or "hh: mm: ss PM"; use a 12 hour clock (am, pm). "am", "pm" "AM", "PM" 1. Example: 3: 38: 09 am, 9: 59 AM, 12: 47: 38 PM. 25

Regular Expression for the Time Write a pattern to match a time string of

Regular Expression for the Time Write a pattern to match a time string of the form "hh: mm: ss am" or "hh: mm: ss PM"; use a 12 hour clock (am, pm). "am", "pm" can be uppercase or lowercase. Example: 3: 38: 09 am, 12: 47: 38 PM. q Don't allow nonsense like 33: 82: 61 am 1? [0 -9]: [0 -5][0 -9] +[Aa. Pp][Mm] q Another way, using a group and repetition: 1? [0 -9](: [0 -5][0 -9]){2} +[Aa. Pp][Mm] q If the seconds are optional (8: 33 am) then use: 1? [0 -9](: [0 -5][0 -9]){1: 2} +[Aa. Pp][Mm] 26

More Matching in Java Examples using w, d "hello". matches("\w+") true "you 2!". matches("\w+")

More Matching in Java Examples using w, d "hello". matches("\w+") true "you 2!". matches("\w+") false "10900". matches("\d{5}") zipcode q q true: match String "split": String [ ] split( regular_expression ) String s = "I like java"; String [] w = s. split("\s+") returns: w[0] = "I", w[1] = "like" w[2] = "java" 27

Pattern Extraction Using a Matcher object, you can also find the position of a

Pattern Extraction Using a Matcher object, you can also find the position of a match, and extract the last matched string. import java. util. regex. *; . . . Pattern pattern = Pattern. compile( "4754dd" ); System. out. println("enter input line to scan: "); String text = console. read. Line( ); Matcher matcher = pattern. matcher( text ); if ( matcher. matches() ) { String id = matcher. group(); // get the string System. out. println("found " + id); while ( matcher. find() ) { // find next match id = matcher. group( ); // get the string System. out. println("found " + id); } } 28

Groups and Pattern Extraction ( expression ) ( ) defines a group that you

Groups and Pattern Extraction ( expression ) ( ) defines a group that you want to re-use or extract. n refers to n-th group matched using ( ). What strings does this pattern match? s. match( "^(w+). *1$" ) 29

NOT Regular Expressions Don't confuse regular expressions (not part of EBNF) with BNF /

NOT Regular Expressions Don't confuse regular expressions (not part of EBNF) with BNF / EBNF notation. Unfortunately, many sources use a hybrid of EBNF and regular expressions. q In regular expressions, (. . . ) is used for grouping, not a list of choices. q Wrong: bogus notation (looks more like EBNF): identifier : : = ([A-Z][a-z]_ )([A-Z][a-z][0 -9]_)* Here, ( ) means "any one of these characters" (abc) means "a" or "b" or "c" 30

Why Use Regular Expressions? Can be directly translated into source code for a tokenizer.

Why Use Regular Expressions? Can be directly translated into source code for a tokenizer. q Shorter than [E]BNF q Many applications and many languages use them. q How to Learn Regular Expressions search the web -- many tutorials q Core Java, p. 698 -702. Java regular expressions not exactly same as syntax in C, Perl, or flex. q Java API for "Pattern" class. q Perl or Flex book - define regex for these languages q 31

Practice q q q Write a lexical description (using a regular expression) for: §

Practice q q q Write a lexical description (using a regular expression) for: § base 10 constants (1234, -1234) § octal constants (0377) § hexadecimal constants (0 x 2 FA 84 D, 0 Xeeef) § floating point constants, with optional exponent Write a Java method to find and extract all the words in a string; a word is a sequence of letters delimited by a non-letter or the begin/end of a line in the string. Write a Java method to remove /*. . . */ comments from a string. 32

Types of lexemes q Common Lexemes (classes of tokens) identifiers: x, println, _INIT, Array.

Types of lexemes q Common Lexemes (classes of tokens) identifiers: x, println, _INIT, Array. List numeric constants: 0, 10000, 2. 98 E+6 assignment operators: =, +=, -=, *=, /=, %= arithmetic operators: *, /, +, -, % boolean operators: &&, ||, ^, ! separators: [ ] ; : . , ( ) string literals: "hello there" Defining many lexemes makes the syntactic grammar more precise q Reserved words: may be defined as a class, or simply treat as identifiers at lexical level q 33

White space and comments “Internal” tokens of the scanner that are matched and discarded

White space and comments “Internal” tokens of the scanner that are matched and discarded q Typical white space: newlines, tabs, spaces q Comments: n /* … */, // … n (C, C++, C#, Java) n # … n (Perl, Unix Shells) n (* … *) (Pascal, ML) n ; … n (Scheme) q Comments generally not nested. q Comments & white space ignored except that they serve as separators of tokens. q 34

FORTRAN is an exception q No reserved words: REAL IF, THEN IF (THEN. GT.

FORTRAN is an exception q No reserved words: REAL IF, THEN IF (THEN. GT. 0) IF = THEN q Compiler ignores spaces (spaces removed before tokenizing): SUM=0. 0 S U M = 0. DO 99 I = 1, 10 DO 99 I = 1. 10 0 (same as SUM=0. 0) (loop: for i : = 1 to 10 ) (assignment: DO 99 I = 1. 10 ) This means that parser must "look ahead" to identify syntax. q Lesson: don't remove white space before tokenizing. q 35

Reserved words versus key words Pascal: uses key words such as "integer", "real". var

Reserved words versus key words Pascal: uses key words such as "integer", "real". var n: integer; integer: real; begin integer = 0. 5; "integer" has special meaning in this context no special meaning here, you can redefine it C: uses reserved words, such as "int", "float", "return". Reserved words may not be redefined in a program. int n; float int; Illegal! "int" is reserved. Reserved words are easier than key words for scanner to recognize, and easier for people to read. 36

Predefined identifiers q q Predefined identifiers have special meanings, but can be redefined (although

Predefined identifiers q q Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t). Examples of predefined identifiers in Java: String, Object, System, null n in Java, you can define your own String or Object class Predefined Identifiers are not Reserved Words Reserved words cannot be used as the name of anything (i. e. , as an identifier) except itself. 37

Java "keywords" (reserved words) abstract assert boolean break byte case catch char class const

Java "keywords" (reserved words) abstract assert boolean break byte case catch char class const continue default do double else enum extends finally float for if goto implements import instanceof interface long native new package private protected public return short static strictfp super switch synchronized this throws transient try void volatile while The Java Language Specification calls these "key words". 38

Java reserved words (cont. ) q q The words const and goto are reserved,

Java reserved words (cont. ) q q The words const and goto are reserved, even though they are not used in the Java language. Why (do you think) Java reserves "goto" and "const" ? 39

Java reserved words (cont. ) foreach : many languages have a "foreach" statement. In

Java reserved words (cont. ) foreach : many languages have a "foreach" statement. In C#: double [ ] data = new double[100]; . . . foreach( double x in data ) { sum += x; } Java 5. 0 defines a new syntax of "for" to do this: for(double x : data ) sum += x; Q: Why did Java use "for" instead of defining a "foreach" ? What is the disadvantage of defining "foreach(var in collection)"? 40

Java reserved words (cont. ) q q true and false aren't listed as "keywords"

Java reserved words (cont. ) q q true and false aren't listed as "keywords" in the language spec. The spec calls them boolean literals (sect 3. 10. 3). Similarly, null is the null literal (sect 3. 10. 7). In actuality they are reserved words! These examples prove it (compiler gives an error msg): /* Encapsulate a constant : -) */ public static true( ) { return true; } public static false( ) { return false; } /* If true and false are mere be allowed to redefine the public void illogical( ) { boolean false = (1==1); boolean true = !false; constants, we should names locally. */ // false = true // true = false 41

Categories of Grammar Rules q Declarations or definitions. Attribute. Declaration : : = [

Categories of Grammar Rules q Declarations or definitions. Attribute. Declaration : : = [ final ] [ static ] [ access ] datatype [ = expression ] { , datatype [ = expression ] } ; access : : = ' public ' | ' protected ' | ' private ' q q Statements. n assignment, if, for, while, do_while Expressions, such as the examples in these slides. q Structures such as statement blocks, methods, and entire classes. Statement. Block : : = '{' { Statement; } '}' 42

Parsing Algorithms (1) Broadly divided into LL and LR. n LL algorithms match input

Parsing Algorithms (1) Broadly divided into LL and LR. n LL algorithms match input directly to left-side symbols, then choose a right-side production that matches the tokens. This is top-down parsing n LR algorithms try to match tokens to the right-side productions, then replace groups of tokens with the left-side nonterminal. They continue until the entire input has been "reduced" to the start symbol n LALR (look-ahead LR) are a special case of LR; they require a few restrictions to the LR case q Reference: Sebesta, section 4. 3 - 4. 5. q 43

Parsing Algorithms (2) q q q Look ahead: n algorithms must look at next

Parsing Algorithms (2) q q q Look ahead: n algorithms must look at next token(s) to decide between alternate productions for current tokens n LALR(1) means LALR with 1 token look-ahead n LL(1) means LL with 1 token look-ahead LL algorithms are simpler and easier to visualize. LR algorithms are more powerful: can parse some grammars that LL cannot, such as left recursion. yacc, bison, and CUP generate LALR(1) parsers Recursive-descent is a useful LL algorithm that "every computer professional should know" [Louden]. 44

Top-down Parsing Example For the input: z = (2*x + 5)*y - 7; tokens:

Top-down Parsing Example For the input: z = (2*x + 5)*y - 7; tokens: ID = ( NUMBER * ID + NUMBER ) * ID - NUMBER ; Grammar rules (as before): assignment => ID = expression ; expression => expression + term | expression - term | term => term * factor | term / factor | factor => ( expression ) | ID | NUMBER 45

Top-down Parsing Example (2) The top-down parser tries to match input to left sides.

Top-down Parsing Example (2) The top-down parser tries to match input to left sides. In the example, GREEN is part matched to the input so far. ID = ( NUMBER * ID + NUMBER )* ID - NUMBER ; assignment ID = expression - term ; ID = term * factor - term ; ID = factor * factor - term ; ID = ( expression + term ) * factor - term ; ID = ( term * factor + term )* factor - term ; ID = ( factor * ID + factor )* factor - term ; ID = ( NUMBER * ID + NUMBER )* ID - factor ; ID = ( NUMBER * ID + NUMBER )* ID - ID ; 46

Top-down Parsing Example (3) q q Problem in example: we had to look ahead

Top-down Parsing Example (3) q q Problem in example: we had to look ahead many tokens in order to know which production to use. This isn't necessary provided that we know the grammar is parsable using LL (top-down) methods. There are conditions on the grammar that we can test to verify this. (see: The Parsing Problem) Later we will study the recursive-descent algorithm which does top-down parsing with minimal look-ahead. 47

The Parsing Problem 48

The Parsing Problem 48

The Parsing Problem q q Top-down parsers must decide which production to use based

The Parsing Problem q q Top-down parsers must decide which production to use based on the current symbol, and perhaps "peeking" at the next symbol (or two. . . ). Predictive parser: a parser that bases its actions on the next available token (called single symbol look-ahead). Two conditions are necessary: [see Louden, p. 108 -110] The first condition is the ability to choose between multiple alternatives, such as: A 1 | 2 |. . . | n n n define First( ) = set of all tokens that can be the first token for any production cascade that produces symbol then a predictive parser can be used for rule A if: First( 1) First( 2). . . First( n) is empty. 49

The Parsing Problem (cont. ) q q The secondition is the ability of the

The Parsing Problem (cont. ) q q The secondition is the ability of the parser to detect presence of an optional element, such as A [ b ]. Can the parser detect for certain whether b is present? Example: list expr [list]. How do we know that list isn't part of expr? n n n define Follow( ) = set of all tokens that can follow the nonterminal some production. Use a special symbol ($) to represent the end of input if can be the end of input. Example: Follow( factor ) = { +, -, *, /, ), $ } while Follow( term ) = { *, /, ), $ } then a predictive parser can detect the presence of optional symbol b if First( b ) Follow( b ) is empty. 50

Review and Thought Questions 51

Review and Thought Questions 51

Lexics vs. Syntax vs. Semantics Division between lexical and syntactic structure is not fixed:

Lexics vs. Syntax vs. Semantics Division between lexical and syntactic structure is not fixed: n a number can be a token or defined by a grammar rule. q Implementation can often decide: n scanners are faster n parsers are more flexible n error checking of number format as regex is simpler q Division between syntax and semantics is not fixed: q n n we could define separate rules for Integer. Number and Floating. Pt. Number , Integer. Term, Floating. Pt. Term, . . . in order to specify which mixed-mode operations are allowed. or specify as part of semantics 52

Numbers: Scan or Parse? We can construct numbers from digits using the scanner or

Numbers: Scan or Parse? We can construct numbers from digits using the scanner or parser. Which is easier / better ? q q Scanner: Define numbers as tokens: number : [-]d+ Parser: grammar rules define numbers (digits are tokens): number => '-' unsignednumber | unsignednumber => unsignednumber digit | digit => 0|1|2|3|4|5|6|7|8|9 53

Is Java 'Class' grammar context-free? q q q A class may have static and

Is Java 'Class' grammar context-free? q q q A class may have static and instance attributes. An inner class or local class have same syntax as toplevel class, but: n may not contain static members (except static constants) n inner class may access outer class using Outer. Class. this n local class cannot be "public" Does this means the syntax for a class depends on context? 54

Alternative operator notation q Some languages use prefix notation: operator comes first expr =>

Alternative operator notation q Some languages use prefix notation: operator comes first expr => + expr | * expr | NUMBER q q Examples: * + 2 3 4 means (2 + 3) * 4 + 2 * 3 4 means 2 + (3 * 4) Using prefix notation, we don't have to worry about precedence of different operators in BNF rules ! 55