Natural Language Processing Lecture 4 Basic Text Processing
Natural Language Processing Lecture 4
Basic Text Processing
Regular expressions • A formal language for specifying text strings • How can we search for any of these? • woodchucks • Woodchucks
Regular Expressions: Disjunctions • Letters inside square brackets [] Pattern Matches [w. W]oodchuck Woodchuck, woodchuck [1234567890] Any digit • Ranges [A-Z] Pattern Matches [A-Z] An upper case letter Drenched Blossoms [a-z] A lower case letter my beans were impatient [0 -9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction • Negations [^Ss] • Carat means negation only when first in [] Pattern Matches [^A-Z] Not an upper case letter Oyfn pripetchik [^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason” [^e^] Neither e nor ^ Look here a^b The pattern a carat b Look up a^b now
Regular Expressions: More Disjunction • Woodchucks is another name for groundhog! • The pipe | for disjunction Pattern Matches groundhog|woodchuck yours|mine yours a|b|c = [abc] [g. G]roundhog|[Ww]oodchuck mine
Regular Expressions: ? * +. Pattern Matches colou? r Optional previous char color aa*b! 0 or more of previous char ab! aaab! aaaab! a+b! 1 or more of previous char ab! aaab! aaaab! colour baa+ baaaaa beg. n begin begun beg 3 n Kleene *, Kleene + Stephen C Kleene
Regular Expressions: Anchors ^ $ Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1 . $ The end? “Hello” The end!
Example • Find me all instances of the word “the” in a text. the Misses capitalized examples [t. T]he Incorrectly returns other or theology [^a-z. A-Z][t. T]he[^a-z. A-Z]
Errors • The process we just went through was based on fixing two kinds of errors • Matching strings that we should not have matched (there, then, other) • False positives (Type I) • Not matching things that we should have matched (The) • False negatives (Type II) • In NLP we are always dealing with these kinds of errors. • Reducing the error rate for an application often involves two antagonistic efforts: • Increasing accuracy or precision (minimizing false positives) • Increasing coverage or recall (minimizing false negatives).
Precision and Recall Precision: How many selected items are relevant Recall: How many relevant items are selected 11
Accuracy and Precision • Accuracy refers to the closeness of a measured value to a standard or known value. For example, if in lab you obtain a weight measurement of 3. 2 kg for a given substance, but the actual or known weight is 10 kg, then your measurement is not accurate. In this case, your measurement is not close to the known value. • Precision refers to the closeness of two or more measurements to each other. Using the example above, if you weigh a given substance five times, and get 3. 2 kg each time, then your measurement is very precise. Precision is independent of accuracy. You can be very precise but inaccurate, as described above. You can also be accurate but imprecise. • https: //labwrite. ncsu. edu/Experimental%20 Design/accuracyprecision. htm 12
Accuracy and Precision • Accuracy refers to how close a measurement is to the true value. • Precision is how consistent results are when measurements are repeated • An easy way to remember the difference between accuracy and precision is: ACcurate is Correct (or Close to real value) PRecise is Repeating (or Repeatable) • . 13
14
Word tokenization
Text Normalization • Every NLP task needs to do text normalization: • Segmenting/tokenizing words in running text • Normalizing word formats • Segmenting sentences in running text
Tokenization • Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization • Input: Friends, Romans, Countrymen, lend me your ears; • Output: 17
How many words? • I do uh main- mainly business data processing • Fragments, filled pauses • Seuss’s cat in the hat is different from other cats! • Lemma: same stem, part of speech, rough word sense • cat and cats = same lemma • Wordform: the full inflected surface form • cat and cats = different wordforms
How many words? • they lay back on the San Francisco grass and looked at the stars and their • Type: an element of the vocabulary. • Token: an instance of that type in running text. • How many? • 15 tokens (or 14) • 13 types (or 12) (or 11? )
How many words? • N = number of tokens • V = vocabulary = set of types • |V| is the size of the vocabulary Church and Gale (1990): |V| > O(N½) Tokens = N Types = |V| Switchboard phone conversations 2. 4 million 20 thousand Shakespeare 884, 000 31 thousand Google N-grams 1 trillion 13 million
Linux TR Command for basic text processing tr is an UNIX utility for translating, or deleting, or squeezing repeated characters. It will read from STDIN and write to STDOUT. tr stands for Translation. Syntax $ tr [OPTION] SET 1 [SET 2] If both the SET 1 and SET 2 are specified then tr command will replace each characters in SET 1 with each character in same position in SET 2. 21
Linux TR Command for basic text processing • complement of a pattern echo abcdefghijklmnop | tr -c 'a' 'z' azzzzzzzz • delete specific characters echo 'clean this up' | tr -d 'up' clean this • the –c option is combined with the –d delete option to extract a phone number. echo 'Phone: 01234 567890' | tr -cd '[: digit: ]' 01234567890 • squeeze repeating characters –s In the following example a string with too many spaces is squeezed to remove them. echo 'too many spaces here‘ | tr -s '[: space: ]' too many spaces here 22
Linux TR Command for basic text processing • To truncate the first set to the second set use the –t option. By Default tr will repeat the last character of the second set if the first and second sets are different lengths. Consider the following example. echo 'the cat in the hat' | tr 'abcdefgh' '123' t 33 31 t in t 33 31 t The last character of the second set is repeated to match any letter from c-h. Using the truncate option limits the matching to the length of the second set. echo 'the cat in the hat' | tr -t 'abcdefgh' '123' the 31 t in the h 1 t • find and replace echo "some_url_that_I_have" | tr "_" "-" some-url-that-I-have 23
Linux TR Command for basic text processing • translate between "upper" and "lower" case characters is to use the "[: lower: ]" and "[: upper: ]" options: $ echo 'welcome to Computer Science' | tr "[: lower: ]" "[: upper: ]" WELCOME TO COMPUTER SCIENCE $ echo 'welcome to Computer Science' | tr "a-z" "A-Z“ WELCOME TO COMPUTER SCIENCE • Delete newline: cat shakes. txt Line one Line two …. cat shakes. txt | tr –d “n” • White Space to Tabs tr [: space: ] 't‘ < shakes. txt | more • White Space to New. Line tr [: space: ] 'n‘< shakes. txt | more 24
Linux TR Command for basic text processing • $ tr [: upper: ] [: lower: ] < items. txt > output. txt $ cat output. txt 25
Simple Tokenization in UNIX • tr -sc A-Za-z n < shakes. txt | sort | uniq -c • Given a text file, output the word tokens and their frequencies • tr -sc A-Za-z n < shakes. txt | sort | uniq -c Change all non-alpha to newlines Sort in alphabetical order Merge and count each type Please study: https: //www. thegeekstuff. com/2012/12/linux-tr-command/
More counting • Merging upper and lower case tr A-Z a-z < shakes. txt | tr -sc A-Za-z n | sort | uniq -c • Sorting the counts tr A-Z a-z < shakes. txt | tr -sc A-Za-z n | sort | uniq -c | sort -nr 23243 22225 18618 16339 15687 12780 12163 10839 10005 8954 the i and to of a you my in d
Issues in Tokenization • Finland’s capital • what’re, I’m, isn’t • Hewlett-Packard • state-of-the-art • Lowercase • San Francisco • m. p. h. , Ph. D. Finlands Finland’s ? What are, I am, is not Hewlett Packard ? state of the art ? lower-case lower case ? one token or two? ? ?
Tokenization: language issues • French • L'ensemble one token or two? • L ? L’ ? Le ? • Want l’ensemble to match with un ensemble • German noun compounds are not segmented • Lebensversicherungsgesellschaftsangestellter • ‘life insurance company employee’ • German information retrieval needs compound splitter
Tokenization: language issues • Chinese and Japanese no spaces between words: • 莎拉波娃�在居住在美国�南部的佛�里达。 • 莎拉波娃 �在 居住 在 美国 �南部 的 佛�里达 • Sharapova now lives in US southeastern Florida • Further complicated in Japanese, with multiple alphabets intermingled • Dates/amounts in multiple formats フォーチュン 500社は情報不足のため時間あた$500 K(約6, 000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana!
Word Tokenization in Chinese • Also called Word Segmentation • Chinese words are composed of characters • Characters are generally 1 syllable and 1 morpheme. • Average word is 2. 4 characters long. • Standard baseline segmentation algorithm: • Maximum Matching (also called Greedy)
Maximum Matching Word Segmentation Algorithm • Given a wordlist of Chinese, and a string. • Start a pointer at the beginning of the string • Find the longest word in dictionary that matches the string starting at pointer • Move the pointer over the word in string • Go to 2
Max-match segmentation illustration • Thecatinthehat • Thetabledownthere the cat in the hat the table down there theta bled own there • Doesn’t generally work in English! • But works astonishingly well in Chinese • 莎拉波娃�在居住在美国�南部的佛�里达。 • 莎拉波娃 �在 居住 在 美国 �南部 的 佛�里达 • Modern probabilistic segmentation algorithms even better
- Slides: 34