Principles of Programming Languages www cs bgu ac
Principles of Programming Languages www. cs. bgu. ac. il/~ppl 192 Lesson 9 –Abstract Syntax Trees
Concrete Syntax: Summary Determines the programming language’s structure. Including: keywords, delimiters, parentheses, indentation. Concrete syntax may be ambiguous, we use additional rules (precedence, associativity) to remove ambiguity. Abstract Syntax Trees (AST) represent in a way that is easy to consume by programs. Many Concrete Syntax Variants can be mapped to the same Abstract Syntax. e. g. , cond and if can be mapped to the same abstract syntax. 2
Programming component The overall parsing process: A scanner implements the lexical rules and outputs a stream of tokens - using regular expressions. A reader parses the stream of tokens into a stream of nested S-expressions (S-exp). A parser maps S-exps into ASTs. scanner===> Stream of tokens, lexical rules parser===> Abstract Syntax, syntactic rules interpreter===> Value, computation rules
Abstract Syntax The abstract syntax of a language captures the following two types of relations between expressions and other expressions: 1. An expression can be of different types. 2. An expression can be composed of sub-expressions, each fulfilling a specific role with respect to the parent expression. For example, in the Scheme abstract syntax, an expression can be of the following types: <expression> ==> <variable> | <procedure call> | <lambda expression> | <conditional> | <literal> We call this relation a disjunction between different types of expressions 2
Abstract Syntax A specific expression type, such as a lambda-expression, contains subexpressions: lambda-expression: formals: List(var-expression) body: List(expression) We call this relation a conjunction between different sub-expressions. Note that we abstract away the keyword 'lambda' and the place of parentheses. How did we use these details? 2
Disjoint Union Type Abstract Syntax defines a data type which denotes a set of values representing all legal expressions in the programming language. Abstract syntax defines new expression categories, and components of composite elements. For each component, the abstract syntax defines: its role in the composite sentence, its category and its cardinality (how many instances of the component can appear in the parent expression). Specifically, we distinguish single value components and List value components. 2
Scheme’s Abstract Syntax / def-exp | cexp <exp> : : = <define> | <cexp> / def-exp(var: var. Decl, val: cexp) <define> : : = ( define <var> <cexp> ) / var. Ref(var: string) <var> : : = <identifier> / binding(var: var. Decl, val: cexp) <binding> : : = ( <var> <cexp> ) / num-exp(val: number) <cexp> : : = <number> | <boolean> / bool-exp(val: boolean) | <string> / str-exp(val: string) | <var. Ref> / var. Ref(var) | ( lambda ( <var. Decl>* ) <cexp>+ ) / proc-exp(params: List(var. Decl), body: Lis t(cexp)) | ( if <cexp> ) / if-exp(test: cexp, then: cexp, else: ce xp) | ( let ( binding* ) <cexp>+ ) / let-exp(bindings: List(binding), body: Li st(cexp)) | ( <cexp>* ) / app-exp(operator: cexp, operands: List(ce xp)) / literal-exp(val: sexp) | ( quote <sexp> ) 2
Implementation
Examples of AST parsing Write the AST of the following expression: (lambda (x) x)
Examples of AST parsing Write the AST of the following expression: (lambda (x) x)
Examples of AST parsing We encode the tree as follows in a Scheme implementation of the AST data type: '(proc-exp ((var-exp x))) In JSON: { tag: 'Proc. Exp', params: [ { tag: 'Var. Decl', var: 'x' } ], body: [ { tag: 'Var. Ref', var: 'x' } ] } We distinguish here between the two roles of variables: variable declarations in the context of the procedure formal parameters) and variable references :
Examples of AST parsing Write the AST of the following expression: (if #t (+ 1 2) 'ok)
Examples of AST parsing Write the AST of the following expression: (if #t (+ 1 2) 'ok)
Examples of AST parsing (if #t (+ 1 2) 'ok) In a Scheme value as follows: '(if-exp (bool-exp #t) (app-exp (var-exp +) ((numexp 1) (num-exp 2))) (literal- exp ok))) And in the following JSON value in Type. Script: { tag: 'If. Exp', test: { tag: 'Bool. Exp', val: true }, then: { tag: 'App. Exp', rator: { tag: 'Prim. Op', op: '+' }, rands: [ { tag: 'Num. Exp', val: 1 }, { tag: 'Num. Exp', val: 2 } ] }, alt: { tag: 'Lit. Exp', val: ok } }
Parse Tree vs. AST Compare the AST with the Parse Tree we practiced last week. These are different trees: The AST is a concrete value that will be manipulated by the interpreter. The Parse Tree is a proof that a concrete expression belongs to the language.
Implementing ASTs in Type. Script ASTs describe types which correspond to Disjoint Union Types. We implement them in Type. Script in the following manner: For every composite type CT , define: • A type definition interface CT { tag: "CT"; . . . } with a field for each constituent • A value constructor named make. CT • A type predicate named 'is. CT` For every disjoint union type DT , define: • • A type definition type DT = CT 1 | CT 2 |. . . A type predicate named is. DT This ‘recipe' provides a functional interface which encapsulates the data type definition.
Implementing ASTs in Type. Script ASTs describe types which correspond to Disjoint Union Types. We implement them in Type. Script in the following manner: For every composite type CT , define: • A type definition interface CT { tag: "CT"; . . . } with a field for each constituent • A value constructor named make. CT • A type predicate named 'is. CT` For every disjoint union type DT , define: • • A type definition type DT = CT 1 | CT 2 |. . . A type predicate named is. DT This ‘recipe' provides a functional interface which encapsulates the data type definition.
Practice Consider the following abstract syntax definition: <E> : : = <number> / num-exp(val: number) | <E> + <E> / add-exp(arg 1: E, arg 2: E) | <E> * <E> / mul-exp(arg 1: E, arg 2: E) What language is defined? Is this language a subset of Scheme? How do we implement its AST?
Solution 1. It implements infix calculation. 2. It is not Scheme expression (no parenthesis). // Disjoint type E = Num. Exp | Add. Exp | Mul. Exp; const is. E = (x: any): x is E => is. Num. Exp(x) || is. Add. Exp(x) || is. Mul. Exp(x); // For each constituent type define an interface Num. Exp { tag: "Num. Exp"; val: const make. Num. Exp = (n: number): Num. Exp const is. Num. Exp = (x: any): x is Num. Exp interface, a constructor and a type predicate. number; }; => ({ tag: "Num. Exp", val: n }); => x. tag === "Num. Exp"; interface Add. Exp { tag: "Add. Exp"; left: E; right: E }; const make. Add. Exp = (left: E, right: E): Add. Exp => ({ tag: "Add. Exp", left: left, right: right }); const is. Add. Exp = (x: any): x is Add. Exp => x. tag === "Add. Exp"; interface Mul. Exp { tag: "Mul. Exp"; left: E; right: E }; const make. Mul. Exp = (left: E, right: E): Mul. Exp => ({ tag: "Mul. Exp", left: left, right: right }); const is. Mul. Exp = (x: any): x is Mul. Exp => x. tag === "Mul. Exp";
Parser as an AST factory Given a stream of tokens, the parser returns an AST of the appropriate type. In Scheme, it is convenient to split the work: Stream of tokens S-exp AST (recall that Scheme expressions are also SIn Type. Script: we use a library to parse s-expressions. > npm install s-expression --save This package performs two functions: Scanning Token stream to s-exp conversion
Example The type of a parser is [Sexp->Exp] The structure of the parser follows the structure of SExp: • Atomic values that are acceptable as Exp • List values represent compound Exp The parser recognizes each of the shapes of the various composite expressions in the grammar. At the end of all branches in the logic, appears a constructor for one of the expression types (make-xxx) according to the AST type definition. This is the main property of a parser as a factory function for ASTs. The parameters of the constructors are recursive calls to the parser.
Example The type of a parser is [Sexp->Exp]. The structure of the parser follows the structure of SExp: Atomic values that are acceptable as Exp List values represent compound Exp The parser recognizes each of the shapes of the various composite expressions in the grammar. We use constructors for each expression type (make-xxx) according to the AST type definition. The parameters of the constructors are recursive calls to the parser. In [8]: // ============================ // Parsing utilities const is. Empty = (x: any): boolean => x. length === 0; const is. Array = (x: any): boolean => x instanceof Array; const is. String = (x: any): boolean => typeof x === "string"; const is. Numeric. String = (x: string): boolean => JSON. stringify(+x) === x; const is. Error = (x: any): x is Error => x instanceof Error; Out[8]: undefined
Example In [15]: // ============================ // Parsing // Make sure to run "npm install ramda s-expression --save" import parse. Sexp = require("s-expression"); const parse. E = (x: string): E | Error => parse. ESexp(parse. Sexp(x)); const parse. ESexp = (sexp: any): E | Error => is. Empty(sexp) ? Error("Unexpected empty") : is. Array(sexp) ? parse. ECompound(sexp) : is. String(sexp) ? parse. EAtomic(sexp) : Error("Unexpected type" + sexp); const parse. ECompound = (sexps: any[]): E | Error => sexps. length !== 3 ? Error("wrong length") : parse. E 3(sexps[1], parse. ESexp(sexps[0]), parse. ESexp(sexps[2])); const parse. E 3 = (op: string, arg 1: E | Error, arg 2: E | Error): E | Error => is. Error(arg 1) ? arg 1 : is. Error(arg 2) ? arg 2 : op === "+" ? make. Add. Exp(arg 1, arg 2) : op === "*" ? make. Mul. Exp(arg 1, arg 2) : Error("Bad operator"); const parse. EAtomic = (sexp: string): E | Error => is. Numeric. String(sexp) ? make. Num. Exp(+sexp) : Error("Bad token " + sexp); Out[15]: undefined
Example parse. E("1");
Example parse. E("1"); { tag: 'Num. Exp', val: 1 }
Example parse. E("(1 + 2)");
Example parse. E("(1 + 2)"); { tag: 'Add. Exp', left: { tag: 'Num. Exp', val: 1 }, right: { tag: 'Num. Exp', val: 2 } }
Example parse. E("(1 + (2 * 3))")
Example parse. E("(1 + (2 * 3))") { tag: 'Add. Exp', left: { tag: 'Num. Exp', val: 1 }, right: { tag: 'Mul. Exp', left: { tag: 'Num. Exp', val: 2 }, right: { tag: 'Num. Exp', val: 3 } } }
Syntax ambiguity? Can we parse "1 + 2 * 3"? No due to syntax ambiguity. The expression can be parsed with additional precedence rules, but requires a more sophisticated parser. You’ll learn more advanced methods in the Compilation course.
Recipe for Processing ASTs Implementing AST as a disjoint union types allows for methodic processing of ASTs. make-constructors and is-type- predicates define an interface that the interpreter can use. For example, what does the following program does? const max = (n 1: number, n 2: number): number => // Compute the height of an E-AST (n 1 > n 2) ? n 1 : n 2; const Eheight = (e: E | Error): number => is. Num. Exp(e) ? 0 : is. Add. Exp(e) ? max(Eheight(e. left), Eheight(e. right)) + 1 : is. Mul. Exp(e) ? max(Eheight(e. left), Eheight(e. right)) + 1 : 0;
Consuming (typical) ASTs This typical processor of AST has the following structure: The type is [AST -> something] Function‘s structure is a conditional expression that covers all expression types according to the AST disjoint-type definition. The program breaks the AST into its components using the AST accessors for the specific type of the branch. (For example, in the branch for is. Add. Exp , we have accessors for Add. Exp. left and Add. Exp. right. ) Usually, the function is called recursively on each of the components of compound AST values. we obtain type-safe code in Type. Script when using this pattern: The type system of Type. Script infers that the parameter e is of type Add. Exp in the clause that follows the is. Add. Exp guard, and similarly for is. Mul. Exp.
Scheme Abstract Syntax Our interpreter adopts the following Scheme abstract syntax: ; ; <program> : : = <exp>+ / program(exps: List(exp)) ; ; <exp> : : = <define> | <cexp> / def-exp | cexp ; ; <define> : : = ( define <var. Decl> <cexp> ) / def-exp(var: var. Decl, val: cexp) ; ; <binding> : : = ( <var. Decl> <cexp> ) / binding(var: var. Decl, val: cexp) ; ; <cexp> : : = <number> / num-exp(val: number) / bool-exp(val: boolean) ; ; | <boolean> / str-exp(val: string) ; ; | <string> / var. Ref(var: string) ; ; | <var. Ref> ; ; | ( lambda ( <var>* ) <cexp>+ ) / proc-exp(params: List(var. Decl), body: Lis t(cexp)) ; ; | ( if <cexp> ) / if-exp(test: cexp, then: cexp, else: ce xp) ; ; | ( let ( binding* ) <cexp>+ ) / let-exp(bindings: List(binding), body: Li st(cexp)) ; ; | ( <cexp>* ) / app-exp(operator: cexp, operands: List(ce xp)) ; ; | ( quote <sexp> ) / literal-exp(val: sexp) The define category is defined in a way that define expressions cannot be embedded inside other expressions. cexp correspond to expressions that can be embedded into each other recursively. (C-exp stands for Constituent expressions - that is, expressions which can occur as components of a larger expression. )
Summary ASTs are implemented using the pattern of disjoint union types. Each compound expression form is represented by a map-like type with a distinct tag and a field for each sub-expression. Unions of compound types correspond to the abstract types such as "expression". We saw a recipe for Type. Script implementation: We implement Union types using a type predicate function is. Exp We implement Compound types by: 1. a value constructor make. C 2. a type predicate is. C 3. an interface type definition with a field for each sub-expression c. subexp Functions operating over ASTs follow the recipe: They have a structure that reflects the type structure, and they use (Structural Induction).
Parsing L 1 Expressions We present a first version of the Type. Script program which encodes the following BNF in a set of disjoint union types in Type. Script. <program> : : = (L 1 <exp>+) // program(exps: List(exp)) <exp> : : = <define-exp> | <cexp> <define-exp> : : = (define <var-decl> <cexp>) // def-exp(var: var-decl, val: cexp) <cexp> : : = <num-exp> // num-exp(val: Number) | <bool-exp> // bool-exp(val: Boolean) | <prim-op> // prim-op(op: String) | <var-ref> // var-ref(var: String) | (<cexp>*) // app-exp(rator: cexp, rands: List(cexp)) <prim-op> : : = + | - | * | / | < | > | = | not <num-exp> : : = a number token <bool-exp> : : = #t | #f <var-ref> : : = an identifier token <var-decl> : : = an identifier token
- Slides: 35