Lecture 4 Lexical Analysis Chomsky Hierarchy Revised based
Lecture 4: Lexical Analysis & Chomsky Hierarchy (Revised based on the Tucker’s slides) 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 1
Revisit Expression Grammar • Let us consider the following Grammar for Assignment: Assignment -> ID ‘=‘ Exp -> Exp + Term | Term -> Term * Integer | ID Integer -> 0 | 1 | …| 9 | 0 Integer | 1 Integer | …| 9 Integer ID -> a | b | … | z | a ID | b ID | … | z ID Build a parse tree abc = x + y 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 2
Levels of Syntax • Lexical syntax = all the basic symbols of the language (names, values, operators, etc. ) • Concrete syntax = rules for writing expressions, statements and programs. • Abstract syntax = internal representation of the program, favoring content over form. E. g. , 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 3
So Expression Grammar • For the following grammar: Assignment -> ID ‘=‘ Exp -> Exp + Term | Term -> Term * Integer | Integer Concrete Syntax Integer -> 0 | 1 | …| 9 | 0 Integer | 1 Integer | …| 9 Integer ID -> a | b | … | z | a ID | b ID | … | z ID Lexical Syntax 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 4
Regular Grammar • Simplest; least powerful • Concentrate on the lexical syntax • Right regular grammar: T*, B N, a T A→ B A → ε (ε is an empty string) or A → A→a 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 5
Regular Grammar • Left regular grammar: T*, B N, a T A→B A→ε A→a • A regular grammar is either a left regular grammar or right regular grammar • Consider the following grammar: S → a. A A → Sb S→ε 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 6
Regular Grammars • Equivalent to: – Regular expression – Finite-state automaton • Used in construction of tokenizers • Less powerful than context-free grammars • Not a regular language { aⁿ bⁿ | n ≥ 1 } and { am bⁿ | 1 ≤m≤n } i. e. , cannot balance: ( ), { }, begin end 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 7
Lexical Analyzer Syntactic Analyzer e Co Inte rme diat rme Inte Code Optimizer Code Generator Machine Code 10/20/2021 ma nti se Fin ds yn tax er ro rs ce rro rs Semantic Analyzer diat e Co tax Syn Abs trac t s ken To Source Program de ( IC ) IC) Compilers & Interpreters Lecture 4: Lexical Analysis & Chomsky Grammar 8
Lexical Analysis • • Purpose: transform program representation Input: printable Ascii characters Output: tokens (Terminals T) Discard: whitespace, comments • Defn: A token is a logically cohesive sequence of characters representing a single symbol. A token is corresponding to a Terminal Symbol in CFG 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 9
Example Tokens • • • Identifiers Literals: 123, 5. 67, 'x', true Keywords: bool char. . . Operators: + - * /. . . Punctuation: ; , ( ) { } 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 10
Other Sequences • Whitespace: space tab • Comments // any-char* end-of-line • End-of-line • End-of-file All of the above languages can be defined by the CFG Grammar. 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 11
Regular Expressions Reg. Expr x x e. g. , n { name } M|N MN M* M 10/20/2021 Meaning a character x an escaped character, a reference to a name M or N M followed by N zero or more occurrences of Lecture 4: Lexical Analysis & Chomsky Grammar 12
Reg. Expr M+ M? [aeiou] [0 -9]. 10/20/2021 Meaning One or more occurrences of M Zero or one occurrence of M the set of vowels the set of digits Any single character Lecture 4: Lexical Analysis & Chomsky Grammar 13
Clite Lexical Syntax • • Category any. Char Letter Digit Whitespace Eol Eof 10/20/2021 Definition [ -~] [a-z. A-Z] [0 -9] [ t] n 04 Lecture 4: Lexical Analysis & Chomsky Grammar 14
Category Definition Identifier integer. Lit float. Lit char. Lit {Letter}({Letter} | {Digit})* {Digit}+. {Digit}+ ‘{any. Char}’ 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 15
Category Operator Definition = | || | && | >= |] Separator Comment | != | | + | - | == < * | | <= / | >| |! | [ 10/20/2021 | { | } | ( | ) // ({any. Char} | {Whitespace})* {eol} : | . Lecture 4: Lexical Analysis & Chomsky Grammar 16
Generators • • Input: usually regular expression Output: table (slow), code C/C++: Lex, Flex Java: JLex 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 17
Chomsky Hierarchy • • Regular grammar -- least powerful Context-free grammar (BNF) Context-sensitive grammar Unrestricted grammar 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 18
Context-free Grammars • BNF a stylized form of CFG • Equivalent to a pushdown automaton • For a wide class of unambiguous CFGs, there are table-driven, linear time parsers 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 19
Context-Sensitive Grammars • Production: • α→β |α| ≤ |β| • α, β (N T)* • ie, lefthand side can be composed of strings of terminals and nonterminals 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 20
Regular Expression Exercise • Describe the languages denoted by the following REs – 0(0|1)*0 – (( |0)1*)* – (0|1)*0(0|1) 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 21
Regular Expression Exercise • Consider a small language using only the letter “z” and “o”, and the slash char “/”. A comment in this language start with “/o” and ends after the very next “o/”. Comments do not nest. (The regular notations that can be used are A|B, A*, A+, Valid: /o/zzzz/oo/, /ozz/oz////o/ Invalid: /o/, /ozzzooo/zzzo/ 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 22
Regular Expression Exercise • Consider a small language using only the letters “z”, “o”, and the slash char “/”. A comment in this language start with “/o” and ends after the very next “o/”. Comments do not nest. (The regular notations that can be used are A|B, A*, A+, /o(o*z|/)*o+/ /o(o|z|/)*o/ /o/*(o*z/*)*o+/ /o(/|oz|oo)*o+/ /o(/*o*z)*/*o+/ 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 23
Regular Expression Exercise • All Strings of 0’s and 1’s to satisfy the following condition – all binary strings except empty string – contains at least three 1 s – does not contain the substring 110 – length is at least 1 and at most 3 10/20/2021 Lecture 4: Lexical Analysis & Chomsky Grammar 24
- Slides: 24