Globalisation Computer systems Week 7 n Text processes
Globalisation & Computer systems Week 7 n Text processes and globalisation part 1: n Sorting strings: collation n Searching strings and regular expressions
Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” n Text processes operate over text elements
Text processes Text elements n The objects of a text n Depends on perspective n Different text processes operate over different objects
Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)
Sorting Language specific n n sort order phonetically based sort graphically based sort element
Sorting Levels of comparison n Level 1 (primary difference) Levels 2 and 3 (similar) Level 4 (exact match)
Sorting Levels of comparison Level 4: exact match n n match in code value character equivalence n resumes : resumes n
Sorting Levels of comparison n Level 1 (primary difference: alphabetic)
Sorting Levels of comparison n Level 1 (primary difference) n resume < resumes
Sorting Levels of comparison n Level 1 (primary difference) n resume < resumes Level 2 (similar: no accent < accent) n n n resume < résumé resumes < résumés Level 3 (similar: lower case < upper case) n n résumé < Résumé
Sorting Forward and backward sequence sort Forward sequence n n Start comparison from beginning of string Backward sequence n n Start comparison from end of string
Sorting Implementation n Sort keys n n n assign set of weights to each character in the string compare substrings according to weighting switch weightings on / off
Searching Text elements n The objects of a text n Depends on perspective n Different text processes operate over different objects
Regular Expressions n n n Basis of all web-based and wordprocessor-based searches Definition 1. An algebraic notation for describing a string Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al. )
Regular Expressions n n n regular expression, text corpus regular expression algebra has variants: Perl, Unix tools: egrep, sed, awk
Regular Expressions Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus. txt n
Regular Expressions egrep -n ‘Nokia’ nokia_corpus. txt
Regular Expressions set operator egrep -n ‘[Nn]okia’ nokia_corpus. txt n
Regular Expressions optional operator egrep -n ‘shares? ’ nokia_corpus. txt n
Regular Expressions egrep -n ‘shares? ’ nokia_corpus. txt
Regular Expressions n Kleene operators: n n /string*/ “zero or more occurrences of previous character” /string+/ “ 1 or more occurrences of previous character”
Regular Expressions n Wildcard operator: n /string. / “any character after the previous character”
Regular Expressions n Wildcard operator: n n /string. / “any character after the previous character” Combine wildcard and kleene: n n /string. */ “zero or more instances of any character after the previous character” /string. +/ “one or more instances of any character after the previous character”
Regular Expressions egrep –n ‘profit. *’ nokia_corpus. txt
Regular Expressions Anchors n Beginning of line operator: ^ egrep ‘^said’ nokia_corpus. txt n End of line operator: $ egrep ‘$said’ nokia_corpus. txt n
Regular Expressions Disjunction: n set operator /[Ss]tring/ “a string which begins with either S or s” n Range /[A-Z]tring/ “a string beginning with a capital letter” n pipe | /string 1|string 2/ “either string 1 or string 2” n
Regular Expressions n Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus. txt egrep –n ‘weak. *|warn. *|drop. *’ nokia_corpus. txt
Regular Expressions n Negation: /[^a-z]tring“ any strings that does not begin with a small letter”
Regular Expressions Precedence n 1. 2. 3. 4. Parantheses Kleene and optional operators *. ? Anchors and sequences Disjunction operator | (a) /supply | iers/
Regular Expressions Precedence n 1. 2. 3. 4. (a) (b) Parantheses Kleene and optional operators *. ? Anchors and sequences Disjunction operator | /supply | iers/ /suppl(y|iers)/ /supply/ /iers/ /supply/ suppliers/
- Slides: 30