Notes on Python Regular Expressions and parser generators

  • Slides: 10
Download presentation
Notes on Python Regular Expressions and parser generators (by D. Parson) • These are

Notes on Python Regular Expressions and parser generators (by D. Parson) • These are the Python supplements to the author’s slides for Chapter 1 and Section 2. 1. • http: //faculty. kutztown. edu/parson/spring 2014/CS C 310 Spring 2014. html has a link to the author’s slides, which are password protected by your K. U. Windows login / password used to access your student account.

Regular Expressions in Python • • re module in the optional Python text. http:

Regular Expressions in Python • • re module in the optional Python text. http: //docs. python. org/library/re. html A RE is a pattern in the form of a string. compile(pattern [, flags]) compiles an RE expression into a finite automaton object. • Return value can be used by other functions. • Flags are for case, multiline, and meta-character options. • search(pattern, string [, flags) searches string for the first match of pattern. • match(pattern, string [, flags) checks at string’s beginning. • Both return a Match. Object or None.

Regular Expressions in Python • split(pattern, string [, maxsplit = 0]) splits string into

Regular Expressions in Python • split(pattern, string [, maxsplit = 0]) splits string into occurrences of pattern. • Returns a list of strings • sub(pattern, repl, string [, count = 0]) performs substitutions of repl for pattern occurrences. • String and sequence operations are related. • http: //docs. python. org/library/string. html • >>> s = "abcde" • >>> dir(s) • ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__len__', '__lt__', '__mod__', '__mul__', '__new__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

Python Regular Expression Examples • >>> m 1 = search('a+z*(b. d)', 'abcdefghi') • >>>

Python Regular Expression Examples • >>> m 1 = search('a+z*(b. d)', 'abcdefghi') • >>> m 1 • <_sre. SRE_Match object at 0 x 11 c 520> • >>> m 1. groups() • ('bcd', ) • >>> m 1. start() • 0 • >>> m 1. end() • • • 4 >>> m 1. start(0) 0 >>> m 1. start(1) 1 # Group 0 is the entire match, 1 is the first parenthesized subexpression, etc.

Learn the major Meta-characters! • • • Text – verbatim text. – any character

Learn the major Meta-characters! • • • Text – verbatim text. – any character except newline ^ – matches start of the string (anchor) $ – matches end of the string * – Kleene start, 0 or more subpattern repetitions + – Kleene plus, 1 or more subpattern repetitions ? – optional, 0 or 1 subpattern occurrence | – alternation, either left or right subpattern () – group a subexpression inside parentheses – escape a meta-character (make it normal) [set of chars], [^set of chars not matched]

More Python RE Examples • >>> m 2 = search('a+z*(b. d)', 'Abcde') • >>>

More Python RE Examples • >>> m 2 = search('a+z*(b. d)', 'Abcde') • >>> m 2 • >>> print m 2 • None • • • >>> split(': ', "abc: cd: e: f") ['abc', 'cd', 'e', 'f'] >>> split('[: ]', "abc: cd: e: f") ['abc', 'cd', 'e', 'f'] >>> split('[^: ]', "abc: cd: e: f") ['', '', ': ', '']

More Python RE Examples (sub) • >>> sub('a([^b]+)b', 'A\1 B', 'a 123 b 45

More Python RE Examples (sub) • >>> sub('a([^b]+)b', 'A\1 B', 'a 123 b 45 ab 67 a 9 b aab') • 'A 123 B 45 ab 67 A 9 B Aa. B' • The parenthesized subexpression matches one or more occurrences of anything except for b. • The matched substring of the first parenthesized subexpression is group 1. • The replacement pattern 1 says “insert group 1 at this point. ” • Effect is to re-insert characters between a and b.

Finite State Automata • A regular expression compiler translates a regular expression into a

Finite State Automata • A regular expression compiler translates a regular expression into a finite state automaton. • This could be a linked data structure or code. It looks like a graph of mapping steps needed for the regular expression. • There are nondeterministic and deterministic flavors. • (a|b)c+d is a simple example expression. a s 1 start b c s 2 c c ε s 3 d s 4 accept

Lookahead 1 types of parsers. • LL(1) and LR(1) grammars require a parser to

Lookahead 1 types of parsers. • LL(1) and LR(1) grammars require a parser to get at most 1 look-ahead terminal from the scanner. • LL(1) cannot handle left-recursive grammar productions. It can handle other recursion. • LR(1) and its variants can handle left, right and nested recursion; left is the most efficient. • A generated parser is essentially a deterministic finite state automaton that uses a stack to keep track of nested syntactic structures. • This topic is covered exhaustively in compiler design.

Parser generators in Python. • • YAPPS 2 is an LL(1) parser generator. http:

Parser generators in Python. • • YAPPS 2 is an LL(1) parser generator. http: //theory. stanford. edu/~amitp/yapps/ http: //pypi. python. org/pypi/Yapps 2 PLY is a Python LALR(1) (subset of LR(1)) equivalent to UNIX YACC and GNU Bison that are used to generate compilers for C code. • http: //www. dabeaz. com/ply/ • Both generate Python executable parsers from stylized Python code.