CSCI 3370 Principles of Programming Languages Syntax Dr




























- Slides: 28

CSCI 3370: Principles of Programming Languages Syntax Dr. Vamsi Paruchuri University of Central Arkansas vparuchuri@uca. edu

Introduction Language description consists of Syntax Specification and a Semantics Specification. ◦ Syntax describes what programs look like; ◦ Semantics describes what they mean.

How translators work Lexical Analysis: A scanner reads the source program (character string) breaks it into a sequence of tokens. � Parsing: A parser takes the sequence of tokens and generates a parse tree representing the structure of the program, based on the Context-Free Grammar for the language. � Semantic Analysis: The parse tree is examined and checked for non-CFG aspects of the language. For example, variable scoping. � Optimization: Various optimization techniques are applied. For example, loop unrolling (Optional). � Code Generation: Finally the internal representation of the program is translated into assembly or machine code. � Linking & Loading : The last phase links in any external libraries or separately compiled code, resolves relative addresses, and loads the program into memory. �

Syntax Lexical structure: the structure of the “words”, or tokens, of a language ◦ Described by Regular expressions Syntactical structure: the structure of “sentences” of a language ◦ Described by Context-free grammars

Typical token categories Reserved words, sometimes called keywords, such as if and while Literals or constants, such as 42 (a numeric literal) or "hello" (a string literal) Special symbols, such as “; ”, “<=”, or “+” Identifiers, such as x 24, monthly_balance, or putchar Lexeme is the smallest unit of a language

Lexical analysis • breaks a single string into a sequence of tokens. For example, the string “x=y/z /* divide, then add */ +100; ” is translated into 8 tokens: “x”, “=”, “y”, “/”, “z”, “+”, “ 100”, “; ” /*Comments are usually discarded in this process*/ • Identifies token categories for each lexeme For example, “=”, “+” are symbols and “x”, “y” are identifiers

Concepts and Notations Set: An unordered collection of unique elements S 1 = { a, b, c } S 2 = { 0, 1, …, 19 } union: S 1 È S 2 = { a, b, c, 0, 1, …, 19 } subset: S 1 Ì U empty set: Æ Alphabet: A finite set of symbols ◦ Examples: Character sets: ASCII, ISO-8859 -1, Unicode S 1 = { a, b } S 2 = { Spring, Summer, Autumn, Winter } String: A sequence of zero or more symbols from an alphabet ◦ The empty string: e

Concepts and Notations Language: A set of strings over an alphabet ◦ Also known as a formal language; may not bear any resemblance to a natural language, but could model a subset of one. ◦ The language comprising all strings over an alphabet å is written as: å*

Regular Expressions A regular expression defines a regular language over an alphabet å: ◦ Æ is a regular language: // ◦ Any symbol from å is a regular language: å = { a, b, c} /a/ /b/ /c/ ◦ Two concatenated regular languages is a regular language: å = { a, b, c} /ab/ /bc/ /ca/

Regular Expressions Regular language (continued): ◦ The union (or disjunction) of two regular languages is a regular language: å = { a, b, c} /ab|bc/ /ca|bb/ ◦ The Kleene closure (denoted by the Kleene star: *) of a regular language is a regular language: å = { a, b, c} /a*/ /(ab|ca)*/ ◦ Parentheses group a sub-language to override operator precedence

Regular expressions (cont) Regular expressions ◦ describe patterns of characters ◦ have three basic operations Concatenation: RS denotes the set { αβ | α in R and β in S }. Selection: R|S denotes the set union of R and S. Repetition: (Kleene star) R* denotes the set of all strings that can be made by concatenating zero or more strings in R. For example, (ab|c)* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", . . . }.

Examples of RE Priority: we assume that Kleene star > concatenation > selection Therefore, ◦ a|b* denotes {ε, a, b, bbb, . . . } ◦ (a|b)* denotes the set of all strings consisting of 0 or more repetitions of either a or b symbols, including the empty string

Extended regular expressions Have more operations and special characters ◦ + means one or more repetitions - "ba+" matches "ba", "baaa" and so on ◦ ? means zero or one repetitions - "ba? " matches "b" or "ba" Examples: ◦ “(h|c)+at" matches with "hat", "cat", "hhat", "chat", "hcat", "ccchat" etc ◦ “(h|c)? at" matches "hat", "cat" and "at"

More Examples 10*0 (a)*(c | d) (a | b) (c | d) (abc)* a*b* (b)*a(b)*a(b)* (a | b)*aaa(a | b)*; (a | b)*(a | b); ◦ Also represented as (a , b)* a(a | b)* ∪ b(a | b)* (b*ab*a)*b* Non-regular language: {anbn / n>0}

Formal Definition of Languages Recognizers ◦ A recognition device reads input strings of the language and decides whether the input strings belong to the language ◦ Example: syntax analysis part of a compiler Generators ◦ A device that generates sentences of a language ◦ One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator

Formal Methods of Describing Syntax Backus-Naur Form and Context-Free Grammars ◦ Most widely known method for describing programming language syntax Extended BNF ◦ Improves readability and writability of BNF Grammars and Recognizers

BNF and Context-Free Grammars ◦ Developed by Noam Chomsky in the mid-1950 s ◦ Language generators, meant to describe the syntax of natural languages ◦ Define a class of languages called context-free languages

Backus-Naur Form (BNF) Backus-Naur Form (1959) ◦ Invented by John Backus to describe Algol 58 ◦ BNF is equivalent to context-free grammars ◦ BNF is a metalanguage used to describe another language ◦ In BNF, abstractions are used to represent classes of syntactic structures--they act like syntactic variables (also called nonterminal symbols) <assign> <var> = <expression>

BNF Fundamentals Non-terminals: BNF abstractions Terminals: lexemes and tokens Grammar: a collection of rules Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> NOTE A lexeme is the lowest level syntactic unit of a language (e. g. , *, sum, begin) A token is a category of lexemes (e. g. , identifier)

BNF Rules A rule has a left-hand side (LHS) and a right-hand side (RHS), and consists of terminal and nonterminal symbols A grammar is a finite nonempty set of rules An abstraction (or nonterminal symbol) can have more than one RHS <stmt> <single_stmt> | begin <stmt_list> end

Describing Lists Syntactic lists are described using recursion <ident_list> ident | ident, <ident_list> A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols) We use rules and productions interchangeably

Definition L(G), the language generated by a grammar G, is the set of all strings in T which can be obtained with a derivation starting from the initial symbol S.

Example of CFG Terminals: the symbols a and b Non-terminals: the symbol S Start symbol: S Productions: S → a. Sb | ε What language does this CFG generate? Another example Terminals: the symbols ( and ) Non-terminals: the symbol A Productions: A → ( ) | (A) | AA What language does this CFG generate?

What strings we have? Examples of strings generated by this grammar are: ◦ () ◦ ( )( ) ◦ (( ))( ) The corresponding derivations are: ◦ A => ( ) ◦ A => AA => ( )( ) ◦ A => AA => (A)( ) => (( ))( )

Yet another example The language of arithmetic expressions with plus and times operations ◦ Terminals: the symbols + and * and (the representation of) the natural numbers ◦ Non-terminals: the symbols Exp and Num ◦ Start symbol: Exp ◦ Productions: Exp → Num | Exp + Exp | Exp * Exp Num → 0 | 1 | 2 | 3 |. . .

What strings we have? Examples of strings generated by this grammar are: ◦ 2+3*5 The corresponding derivations are: ◦ Exp => Num => 2 ◦ Exp => Exp + Exp => Num + Num => 2 + 3 ◦ Exp => Exp + Exp => Num + Exp * Exp => Num + Num * Num => 2 + 3 * 5

A better one Productions: ◦ Exp → Num | Exp + Exp | Exp * Exp ◦ Num → Num Digit | Digit ◦ Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 There is still a problem ◦ We don’t normally think “ 00123” is the right way of writing a natural number.

Sample solution Exp → Good. Num | Exp + Exp | Exp * Exp Good. Num → NZDigit Num | Digit Num → Num Digit | Digit → NZDigit | 0 NZDigit → 1|2|3|4|5|6|7|8|9