Course 2 Regular expressions and Ant Conc concordance

Regular expressions In corpus linguistics much of what we want to do with a

Regular expressions • Regexes are expressions that represent string patterns. • They are extremely

Chomsky Hierarchy From most to least strict, the four formal grammars in CH are:

Regular Grammar - linguistic flavour A regular grammar is a mathematical object, G, with

The Language Generated by a Regular Grammar Let G be a regular grammar. The

Regular grammar-regex equivalence • A formal grammar (like regular grammar) generates and recognizes a

Regexes - CS flavour Regular expression is recursively defined as follows: • The empty

Regexes - CS flavour • Alternation: If x and y are regular expressions, then

Regexes - CS flavour • There are some other operators derived from combinations of

Summary OR: A vertical bar separates alternatives. For example, gray|grey can match "gray" or

Summary • {n} The preceding item is matched exactly n times. • {min, }

Ant. Conc • Ant. Conc is a general purpose tool for analysing corpora. •

Basics • • • Loading corpus files Viewing files Word list Concordance tool Tool

Word list • A word list produces a list of words, ordered in their

Word list • Word list range: use specific words (only the words the user

Concordance tool • Search for words and patterns • Sort by left and right

Concordance tool • Search for regular expressions: – br[a-z]+? tb – bcatw. – (cat|dog)

Slides: 19

Download presentation

Course 2 Regular expressions and Ant. Conc concordance tool University of Bucharest Digital Humanities Master February 2020 Anca Dinu

Regular expressions In corpus linguistics much of what we want to do with a tool is pattern-matching over texts/corpora, like: • Find all words that begin with k and end with a vowel. • Find all words that have a sequence of three vowels. • Find all three-syllable words. • Find all adjectives ending in -ic. • Find all plural nouns preceded by the in questions.

Regular expressions • Regexes are expressions that represent string patterns. • They are extremely useful in extracting information from any text by searching for matches of a specific search pattern (i. e. a specific sequence of ASCII or unicode characters). • Regular expression searches are the most popular, powerful, and easiest tool to use. • They originated in Chomsky's Hierarchy and formalized by mathematician Stephan Kleene (*).

Chomsky Hierarchy From most to least strict, the four formal grammars in CH are: Regular grammars, which retain no past state knowledge from input string to output string. Context-free grammars, which retain only recent state knowledge from input string to output string. Context-sensitive grammars, which keep all past state knowledge from input string to output string. Unrestricted (or recursively enumerable) grammars, which have all state knowledge and thus can create every output string imaginable from a given input string.

Chomsky Hierarchy

Regular Grammar - linguistic flavour A regular grammar is a mathematical object, G, with four components, G = (N, Σ, P, S), where • N is a nonempty, finite set of nonterminal symbols, • Σ is a finite set of terminal symbols , or alphabet, symbols, • P is a set of grammar rules, each of one having one of the forms – A → a. B – A→a – A → ε, for A, B ∈ N, a ∈ Σ, and ε the empty string, and • S ∈ N is the start symbol.

The Language Generated by a Regular Grammar Let G be a regular grammar. The language generated by the regular grammar G= (N, Σ, P, S) is L(G) = {w | S ⇒ * w, where w ∈ Σ*} • Translation: the language of a regular grammar is the set of all strings over the alphabet Σ that can be derived from the start symbol S by application of the grammar rules.

Regular grammar-regex equivalence • A formal grammar (like regular grammar) generates and recognizes a language. • Regexes do the same. • Regular grammars are equivalent with regexes (approximately)

Regexes - CS flavour Regular expression is recursively defined as follows: • The empty set is a regular expression. • The empty string is a regular expression. • For any character x in the input alphabet, x is a regular expression that produces the regular language {x}. • Plus the following 3 operations:

Regexes - CS flavour • Alternation: If x and y are regular expressions, then x | y is a regular expression. For example, the regular expression a|b produces the regular language {a, b}. • Concatenation: If x and y are regular expressions, then x • y is a regular expression. For example, the regular expression a • b produces the regular language {ab}. • Repetition (Kleene star): If x and y are regular expressions, then x* is a regular expression. For example, the regular language a • b* produces the regular language {a, abb, abbb, . . . }.

Regexes - CS flavour • There are some other operators derived from combinations of the three original operations on regexes (alternation, concatenation, repetition): • +, *, etc (see regular expression cheat sheet of Michael Yoshitaka Erlewine) • parenthesis add extra power w. r. t. Regular Grammars • Special characters need to be escaped- preceded by - to be interpreted literally.

Summary OR: A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey". Grouping: Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey". Quantification (after a token) • ? indicates zero or one occurrences of the preceding element. For example, colou? r matches both "color" and "colour". • * indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abbc", "abbbc". • + indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbbc", . . . , but not "ac".

Summary • {n} The preceding item is matched exactly n times. • {min, } The preceding item is matched min or more times. • {min, max} The preceding item is matched at least min times, but not more than max times. Wildcard: . matches any character. For example, a. b matches any string that contains an "a", then any other character and then a "b", a. *b matches any string that contains an "a" and a "b" at some later point. Take a look at examples https: //medium. com/factory-mind/regex-tutorial-a-simplecheatsheet-by-examples-649 dc 1 c 3 f 285

Ant. Conc • Ant. Conc is a general purpose tool for analysing corpora. • It is free and easy to download and use. • It can be used for virtually any language. • It supports plain and annotated corpora. • Made by Anthony Laurence: http: //www. laurenceanthony. net/software/antconc/

Basics • • • Loading corpus files Viewing files Word list Concordance tool Tool preferences Global settings

Word list • A word list produces a list of words, ordered in their frequency order they appear in a corpus; • Sort by: the frequency (default), by the word (alphabetically), by the word end, by inverse order; • The word list can be saved by Ant. Conc as a text file; • Tool preferences for Word list: – Lemma list: a list with the inflections of words, for instance for be: is, are, been, was, were, etc. It returns the list of head words, accompanied by their family words (inflections) and their frequency.

Word list • Word list range: use specific words (only the words the user is interested in) or use stop list (exclude the words in the stop list). The use of those two options depends on the goal of the analysis: – if the user studies the stylom of an author or authorship identification, s/he could look only for function words, like prepositions or pronouns, because they are harder to mistify; – if the user performs a semantic study, s/he might want to exclude the functional (stop) words.

Concordance tool • Search for words and patterns • Sort by left and right context – Ex: report, reported, reporting, report on, to report • Search with wildcards: – Ex: report* (all wildcards are in Global Settings) • Editing tricks: click on the highlighted words, using shift, alt, ctrl • Search options: not word (rep, por), lower/upper case

Concordance tool • Search for regular expressions: – br[a-z]+? tb – bcatw. – (cat|dog) – [aeiou][aeiou] – d – b(w+)erb • Advanced search: load multiple search words or a search list from a file and search for a context word in a window (said) • ‘Clone results’ for comparing 2 or more results • Exporting the results: Tool preferences (adding a delimiter to the hit word, as tab, for copy-pasting into Excel spreadsheet).