Corpus Linguistics Practical utilities Lecture 7 Albert Gatt
Corpus Linguistics. Practical utilities (Lecture 7) Albert Gatt
Corpus search o We have encountered the use of wordbased and phrase-based searches. o We now introduce some practical tools to find patterns: n regular expressions n the corpus query language (CQL): o developed by the Corpora and Lexicons Group, University of Stuttgart o a language for building complex queries using: n regular expressions n attributes and values
Regular expressions o A regular expression is a pattern that matches some sequence in a text. It is a mixture of: n characters or strings of text n special characters n groups or ranges o e. g. “match a string starting with the letter S and ending in ane”
Delimiting regexes o Special characters for start and end: n /^man/ => any sequence which begins with “man”: man, manned, manning. . . n /man$/ => any sequence ending with “man”: doberman, policeman. . . n /^man$/=> any sequence consisting of “man” only
Groups of characters and choices o /[wh]ood/ n matches wood or hood n […] signifies a choice of characters o /[^wh]ood/ n matches mood, food, but not wood or hood n /[^…]/ signifies any character but what’s in the brackets
Ranges o Some sets of characters can be expressed as ranges: o /[a-z]/ n any alphabetic, lower-case character o /[0 -9]/ n any digit between 0 and 9 o /[a-z. A-Z]/ n any alphabetic, upper- or lower-case character
Disjunction and wildcards o /ba. / n matches bat, bad, … n /. / means “any alphanumeric character” o /gupp(y|ies)/ n guppy OR guppies n /(x|y)/ means “either X or Y” n important to use parentheses!
Quantifiers (I) o /colou? r/ n matches color or colour o /govern(ment)? / n matches govern or government o /? / means zero or one of the preceding character or group
Quantifiers (II) o /ba+/ n matches ba, baaa… o /(inkiss )+/ n matches inkiss, inkiss n (note the whitespace in the regex) o /+/ means “one or more of the preceding character or group”
Quantifiers (III) o /ba*/ n matches b, baa, baaa n /*/ means “zero or more of the preceding character or group” o /(ba ){1, 3}/ n matches ba, ba ba or ba ba ba n {n, m} means “between n and m” o /(ba ){2}/ n matches ba ba n {n} means “exactly n”
- Slides: 10