Globalisation Computer systems Week 8 n Finish Text

Globalisation & Computer systems Week 8 n Finish Text processes part 1 n n n Searching strings and regular expressions Practical: regular expressions in UNIX Text processes part 2 n Spell checkers

Searching Text elements n The objects of a text n Depends on perspective n Different text processes operate over different objects

Regular Expressions n Basis of all web-based and wordprocessor-based searches

Regular Expressions n n Basis of all web-based and wordprocessor-based searches Definition 1. An algebraic notation for describing a string

Regular Expressions n n n Basis of all web-based and wordprocessor-based searches Definition 1. An algebraic notation for describing a string Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al. )

Regular Expressions n n n regular expression, text corpus regular expression algebra has variants: Perl, Unix tools: egrep, sed, awk

Regular Expressions n Find occurrences of /Nokia/ in the text

Regular Expressions Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus. txt n

Regular Expressions egrep -n ‘Nokia’ nokia_corpus. txt

Regular Expressions n Suppress case distinctions n Nokia or nokia

Regular Expressions set operator egrep -n ‘[Nn]okia’ nokia_corpus. txt n

Regular Expressions n Suppress other features, for example singular share or plural shares

Regular Expressions optional operator egrep -n ‘shares? ’ nokia_corpus. txt n

Regular Expressions egrep -n ‘shares? ’ nokia_corpus. txt

Regular Expressions n Kleene operators: n n /string*/ “zero or more occurrences of previous character” /string+/ “ 1 or more occurrences of previous character”

Regular Expressions n Wildcard operator: n /string. / “any character after the previous character”

Regular Expressions n Wildcard operator: n n /string. / “any character after the previous character” Combine wildcard and kleene: n n /string. */ “zero or more instances of any character after the previous character” /string. +/ “one or more instances of any character after the previous character”

Regular Expressions egrep –n ‘profit. *’ nokia_corpus. txt

Regular Expressions Anchors n Beginning of line operator: ^ egrep ‘^said’ nokia_corpus. txt n End of line operator: $ egrep ‘$said’ nokia_corpus. txt n

Regular Expressions Disjunction: n set operator /[Ss]tring/ “a string which begins with either S or s” n Range /[A-Z]tring/ “a string beginning with a capital letter” n pipe | /string 1|string 2/ “either string 1 or string 2” n

Regular Expressions n Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus. txt egrep –n ‘weak. *|warn. *|drop. *’ nokia_corpus. txt

Regular Expressions n Negation: /[^a-z]tring“ any strings that does not begin with a small letter”

Regular Expressions Precedence n 1. 2. 3. 4. Parantheses Kleene and optional operators *. ? Anchors and sequences Disjunction operator | (a) /supply | iers/

Regular Expressions Precedence n 4. Parantheses Kleene and optional operators *. ? Anchors and sequences Disjunction operator | (a) /supply | iers/ 1. 2. 3. /supply/ /iers/

Regular Expressions Precedence n 1. 2. 3. 4. (a) (b) Parantheses Kleene and optional operators *. ? Anchors and sequences Disjunction operator | /supply | iers/ /suppl(y|iers)/ /supply/ /iers/

Regular Expressions Precedence n 1. 2. 3. 4. (a) (b) Parantheses Kleene and optional operators *. ? Anchors and sequences Disjunction operator | /supply | iers/ /suppl(y|iers)/ /supply/ /iers/ /supply/ suppliers/

Spelling dictionaries aim? given a sequence of symbols: n n n 1. identify misspelled strings 2. generate a list of possible ‘candidate’ correct strings 3. select most probable candidate from the list

Spelling dictionaries Implementation: n n Probabilistic framework bayesian rule noisy channel model

Spelling dictionaries Types of spelling error n n n actual word errors non-word errors

Spelling dictionaries Types of spelling error n n actual word errors n n n /piece/ instead of /peace/ /there/ instead of /their/ non-word errors

Spelling dictionaries Types of spelling error n n actual word errors n n n /piece/ instead of /peace/ /there/ instead of /their/ non-word errors n /graffe/ instead of /giraffe/

Spelling dictionaries Types of spelling error n n actual word errors n non-word errors n n /piece/ instead of /peace/ /there/ instead of /their/ /graffe/ instead of /giraffe/ of all errors in type written texts, 80% are nonword errors

Spelling dictionaries non-word errors n n Cognitive errors n n n /seperate/ instead of /separate/ phonetically equivalent sequence of symbols has been substituted due to lack of knowledge about spelling conventions

Spelling dictionaries non-word errors n n n Cognitive errors Typographic (‘typo’) errors n n n influenced by keyboard e. g. substitution of /w/ for /e/ due to its adjacency on the keyboard /thw/ instead of /the/

Spelling dictionaries non-word errors noisy channel model n n n The actual word has been passed through a noisy communication channel This has distorted the word, thereby changing it in some way The misspelled word is the distorted version of the actual word Aim: recover the actual word by hypothesising about the possible ways in which it could have been distorted

Spelling dictionaries non-word errors noisy channel model What are the possible distortions? n n n n insertion deletion substitution transposition all of these viewed as transformations that take place in the noisy channel

Spelling dictionaries n Implementing spelling identification and correction algorithm

Spelling dictionaries Implementing spelling identification and correction algorithm n n n STAGE 1: compare each string in document with a list of legal strings; if no corresponding string in list mark as misspelled STAGE 2: n n generate list of candidates Apply any single transformation to the typo string Filter the list by checking against a dictionary STAGE 3: assign probability values to each candidate in the list STAGE 4: select best candidate

Spelling dictionaries STAGE 3 n n prior probability n n n likelihood n n n given all the words in English, is this candidate more likely to be what the typist meant than that candidate? P(c) = c/N where N is the number of words in a corpus Given, the possible errors, or transformation, how likely is it that error y has operated on candidate x to produce the typo? P(t/c), calculated using a corpus of errors, or transformations Bayesian rule: n n get the product of the prior probability and the likelihood P(c) X P(t/c)

Spelling dictionaries non-word errors Implementing spelling identification and correction algorithm n n n n STAGE 1: identify misspelled words STAGE 2: generate list of candidates STAGE 3 a: rank candidates for probability STAGE 3 b: select best candidate Implement: n n noisy channel model Bayesian Rule

Next week Resources for globalisation n Machine translation n Translation memory