Programming Languages formal languages How to describe them
Programming Languages (formal languages) -- How to describe them? -- How to use them? (machine and human) Syntax: Describe the structures of programs Grammars -- Ambiguous (sometimes) solution: using unambiguous only Semantics: Describe the meaning of programs Textbook, manuals -Confusing (always) solution: denotation semantics (for nuts only) 11/21/2020 IT 327 1
English Grammar The man hit the ball. subject verb object The man saw the girl with a telescope. subject verb object The purpose of grammar: To tell whether a sentence is valid. (old fashion) Chomsky: To have a device that can generate all valid sentences of the target language (from a root). 11/21/2020 IT 327 2
Noam Chomsky (1928 - ) Syntactic Structures (1957) Generative Grammar 11/21/2020 A valid sentence is generated from the root according to some fixed rules (grammar). IT 327 3
Directly copied from Chomsky’s book 11/21/2020 IT 327 4
Syntactic Structures the man hit the ball S NP VP T N Verb the man hit 11/21/2020 IT 327 NP T N the ball 5
A generative grammar in Syntactic Structures root S NP non-terminal symbols VP T + N Verb + NP T the | a N man | ball | car Verb 11/21/2020 NP + VP terminal symbols …. . hit | take | took | run | ran …. . IT 327 6
Grammar 1 Backus-Naur Form, BNF Grammar 2 <S> : : = <NP> <VP> <S> : : = <NP> <V> <NP> : : = <T> <NP> : : = <A> <N> <VP> : : = <V> <NP> <V> : : = loves | hates|eats <T> : : = the <A> : : = a | the <N> : : = man | car| ball <N> : : = dog | cat | rat <V> : : = hit | took <S> : : = <NP> <V> <NP> | <NP> <VP> <NP> : : = <T> <N> | <A> <N> <VP> : : = <V> <NP> <V> : : = loves | hates|eats |hit | took <A> : : = a | the <T> : : = the <N> : : = dog | cat | rat | man | car | ball 11/21/2020 IT 327 7
Deviation: the sequence of processes that generate a sentence <S> : : = <NP> <VP> <NP> : : = <T> <N> <VP> : : = <V> <NP> <T> : : = the <N> : : = man | car | ball <V> : = hit | took Grammar 1 <S> <NP> <VP> <T> <N> <VP> the man <V> <NP> the man hit <T> <N> the man hit the ball 11/21/2020 IT 327 8
(American Heritage Dict. ) Parse: v. To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part. the dog loves the cat × × 11/21/2020 the loves dog the cat loves the dog the cat IT 327 9
<S> : : = <NP> <V> <NP> : : = <A> <N> <V> : : = loves | hates|eats <A> : : = a | the <N> : : = dog | cat | rat Grammar A Parse Tree <S> <NP> <V> <NP> <A> <N> the dog “the loves dog the cat” 11/21/2020 IT 327 loves <A> <N> the cat doesn’t have a parse tree 10
Restrictions on Grammars Diagram in terms of the sizes of the set of restrictions Unrestricted Grammars (type-0) Context Sensitive (type-1) Context Free (type-2) Right/Left Linear Grammars (type-3) Why context sensitive grammars have less restrictions than context free grammars? 11/21/2020 IT 327 11
Chomsky Hierarchy Diagram in terms of the sizes of the language families Regular Expressions (type-3) Context-free languages (type-2) Context-sensitive languages (type-1) Computable (formal) languages (type-0) 11/21/2020 IT 327 12
A grammar for Arithmetic Expression Example: ((a+b)*c) Is this expression valid? <exp> : : = <exp> + <exp> | <exp> * <exp> | ( <exp> ) | a | b | c <exp> ( <exp> ) ( <exp> * <exp> ) (( <exp> ) * <exp> ) ((<exp> + <exp> ) * <exp> ) ((a +b) * <exp> ) ((a+b)*c) Yes 11/21/2020 IT 327 13
<exp> ( A Parse Tree for ((a+b)*c) ) <exp> * <exp> ( <exp> ) c <exp> + <exp> a 11/21/2020 IT 327 b 14
Parse Trees for a+b*c ? <exp> + <exp> * <exp> + <exp> a c a <exp> * <exp> b b c What is the meaning of a+b*c 11/21/2020 IT 327 15
Grammars in BNF (Backus-Naur Form) • A BNF grammar consists of four parts: – The finite set of tokens (terminal symbols) – The finite set of non-terminal symbols – The start symbol – The finite set of production rules <S> : : = <NP> <VP> <NP> : : = <T> <N> <VP> : : = <V> <NP> <T> : : = the <N> : : = man | ball <V> : : = hit | took 11/21/2020 IT 327 16
Constructing Grammars <var-dec> float a; boolean a, b, c; int a, b; • Using divide and conquer to simplify the job. 1. Data types, variable names (identifiers) 2. Statements, programs • One variable, one type, but this is not grammar’s job to make sure) 11/21/2020 IT 327 17
Using divide and conquer <var-dec> : : = <type-name> <declarator-list> ; Primitive type names <type-name> : : = boolean | byte | short | int | long | char | float | double <declarator-list> : : = <declarator> | <declarator> , <declarator-list> <declarator> : : = <variable-name> | <variable-name> = <expr> 11/21/2020 IT 327 18
• Programs stored in files are just sequences of characters, but we want to prepare them into tokens before further analysis. Tokens are atoms of the program Tokens: e. g. Reserved words • • identifiers (const, x, fact) keywords (if, const) operators (==) constants (123. 4), etc. • How to divide a program (a sequence of characters in a file) into a sequence of tokens? 11/21/2020 IT 327 19
Lexical Structure & Phrase Structure • Grammars so far have defined phrase structure: how a program is built from a sequence of tokens • We also need to use grammars to define lexical structure: how a text file is divided into tokens 11/21/2020 IT 327 20
Separate Grammars • Usually there are two separate grammars – to construct a sequence of tokens from a file of characters (Lexical Structure) <program-file> : : = <end-of-file> | <element> <program-file> <element> : : = <token> | <one-white-space> | <comment> <one-white-space> : : = <space> | <tab> | <end-of-line> <token> : : = <identifier> | <operator> | <constant> | … – to construct a parse tree from a sequence of tokens (Phrase Structure) 11/21/2020 IT 327 21
Separate Compiler Passes • Scanner tokens string • parser parse tree • (more to do afterwards) 11/21/2020 IT 327 22
Historical Note #1 • Early languages sometimes did not separate lexical structure from phrase structure – Early Fortran and Algol dialects allowed spaces anywhere, even in the middle of a keyword – Other languages like PL/I or early Fortran allow keywords to be used as identifiers This makes them difficult to scan and parse It also reduces readability 11/21/2020 IT 327 23
Historical Note #2 • Some languages have a fixed-format lexical structure -column positions are significant. Examples: – One statement per line (i. e. per card) – First few columns for statement label – Etc. • Early dialects of Fortran, Cobol, and Basic • Almost all modern languages are free-format: column positions are ignored (exception: Python) 11/21/2020 IT 327 24
Backus-Naur Form, BNF • Some use or = instead of : : = • Some leave out the angle brackets and use a distinct typeface for tokens • Some allow single quotes around tokens, for example to distinguish ‘|’ as a token from | as a meta-symbol Interesting operator!! Or not! Sir, please Step away from the ASR-33 11/21/2020 IT 327 25
E E+T |E-T |T T T*F|T/F|F F (E)| a | b | c Backus-Naur Form, BNF <exp> : : = <term> + <exp> | <term> : : = <fact> * <term> | <fact> : : = ( <exp> ) | a | b | c <stmt> : : = <if-stmt> | s 1 | s 2 <if-stmt> : : = if <expr> then <stmt> else <stmt> | if <expr> then <stmt> <expr> : : = e 1 | e 2 11/21/2020 IT 327 26
Other Grammar Forms • BNF variations • EBNF variations • Syntax diagrams 11/21/2020 IT 327 27
EBNF Variations • Additional syntax to simplify some grammar chores: – {x} to mean zero or more repetitions of x – [x] to mean x is optional (i. e. x | <empty>) – () for grouping – | anywhere to mean a choice among alternatives – Quotes around tokens, if necessary, to distinguish from meta-symbols 11/21/2020 IT 327 28
EBNF Examples • Anything that extends BNF this way is called an Extended BNF: EBNF • There are many variations <if-stmt> : : = if <expr> then <stmt> [else <stmt>] <stmt-list> : : = {<stmt> ; } <thing-list> : : = { (<stmt> | <declaration>) ; } 11/21/2020 IT 327 29
Syntax Diagrams • Syntax diagrams (“railroad diagrams”) <if-stmt> : : = if <expr> then <stmt> else <stmt> if-stmt 11/21/2020 if expr then IT 327 stmt else stmt 30
Bypasses <if-stmt> : : = if <expr> then <stmt> [else <stmt>] if-stmt 11/21/2020 if expr then IT 327 stmt else stmt 31
Branching <exp> : : = <exp> + <exp> | <exp> * <exp> | ( <exp> ) |a|b|c 11/21/2020 IT 327 32
Loops <exp> : : = <addend> {+ <addend>} 11/21/2020 IT 327 33
Syntax Diagrams, Pro and Con • Easier for human to read (follow) • Difficult to perceive the phrase structures (syntax tree)? • Harder for machine to read (for automatic parser-generators) 11/21/2020 IT 327 34
Conclusion • We use grammars to define programming language syntax, both lexical structure and phrase structure • Connection between theory and practice – Two grammars, two compiler passes – Parser-generators can produce code for those two passes automatically from grammars (compiler tools) 11/21/2020 IT 327 35
- Slides: 35