LING 388 Computers and Language Lecture 18 Administrivia

  • Slides: 32
Download presentation
LING 388: Computers and Language Lecture 18

LING 388: Computers and Language Lecture 18

Administrivia • Homework 8 Review • More Python Regex practice? • Homework 9: install

Administrivia • Homework 8 Review • More Python Regex practice? • Homework 9: install nltk and nltk_data on your laptops

Homework 8 review • • Italy coronavirus deaths rise by record 475 in a

Homework 8 review • • Italy coronavirus deaths rise by record 475 in a day https: //www. bbc. com/news/world-europe-51952712 hw 8. txt Using regexs in Python, find: 1. Find the numbers in the article. List them. How many of them are there? 2. Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations, events etc. ), e. g. World Health Organization or WHO. List them. How many of them are there? 3. Propose how could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code if you have it. How many named entities now?

Homework 8 Review: Part 1 import re fh = open('hw 8. txt') text =

Homework 8 Review: Part 1 import re fh = open('hw 8. txt') text = fh. read() re. findall(r"[0 -9, . ]+b", text) ['475', '18', '475', '3, 000', '35, 713', '4, 000', '319', '8, 758', '200, 000', '80', '598', '13, 716', '17', '19', '16', '7, 730', '175', '7', '65', '104', '12', '8, 198', '14', '1, 486', '27', '30', '500'] len(re. findall(r"[0 -9, . ]+b", text)) 27

Homework 8 review • • Italy coronavirus deaths rise by record 475 in a

Homework 8 review • • Italy coronavirus deaths rise by record 475 in a day https: //www. bbc. com/news/world-europe-51952712 hw 8. txt Using regexs in Python, find: 1. Find the numbers in the article. List them. How many of them are there? 2. Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations, events etc. ), e. g. World Health Organization or WHO. List them. How many of them are there? 3. Propose how could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code if you have it. How many named entities now?

Homework 8 Review: Part 2 [tuple[0] for tuple in re. findall(r"([A-Z]w+(s+[A-Z]w+)*)", text)] ['Italy', 'Quartieri

Homework 8 Review: Part 2 [tuple[0] for tuple in re. findall(r"([A-Z]w+(s+[A-Z]w+)*)", text)] ['Italy', 'Quartieri Spagnoli', 'Naples', 'Italy', 'March', 'EPAn. The', 'Italy', 'There', 'Lombardy', 'Italy', 'China', 'At', 'China', 'The', 'Europe', 'Western Pacific', 'Asia', 'World Health Organization', 'WHO', 'US', 'Canada', 'Eurovision Song Contest', 'Many', 'But', 'WHO', 'Tedros Ghebreyesus', 'Wednesday', 'Italy', 'People', 'The WHO', 'Mike Ryan', 'Dr Tedros', 'Reportn. The', 'US', 'Kaiser Permanente', 'Seattle', 'But', 'How', 'Europe', 'Spain', 'An', 'Madrid', 'Covid', 'In France', 'Tuesday', 'In', 'UK', 'Germany', 'In', 'TV', 'Chancellor Angela Merkel', 'Germans', 'Since German', 'World War Two', 'Belgium', 'Coronavirus', 'Venice', 'What', 'Europe', 'Travellers', 'EU', 'The', 'Europeans', 'UK', 'Brexit', 'UK', 'EU', 'The', 'EU', 'Schengen', 'Iceland', 'Switzerland', 'Norway', 'Liechtenstein', 'All', 'Germany', 'Morocco', 'Egypt', 'Philippines', 'Argentina', 'In'] len([tuple[0] for tuple in re. findall(r"([A-Z]w+(s+[A-Z]w+)*)", text)]) 85

Homework 8 review • • Italy coronavirus deaths rise by record 475 in a

Homework 8 review • • Italy coronavirus deaths rise by record 475 in a day https: //www. bbc. com/news/world-europe-51952712 hw 8. txt Using regexs in Python, find: 1. Find the numbers in the article. List them. How many of them are there? 2. Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations, events etc. ), e. g. World Health Organization or WHO. List them. How many of them are there? 3. Propose how could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code if you have it. How many named entities now?

Homework 8 Review: Part 3 Problem • (about 24 words don't belong): • ['Italy',

Homework 8 Review: Part 3 Problem • (about 24 words don't belong): • ['Italy', 'Quartieri Spagnoli', 'Naples', 'Italy', 'March', 'EPAn. The', 'Italy', 'There', 'Lombardy', 'Italy', 'China', 'At', 'China', 'The', 'Europe', 'Western Pacific', 'Asia', 'World Health Organization', 'WHO', 'US', 'Canada', 'Eurovision Song Contest', 'Many', 'But', 'WHO', 'Tedros Ghebreyesus', 'Wednesday', 'Italy', 'People', 'The WHO', 'Mike Ryan', 'Dr Tedros', 'Reportn. The', 'US', 'Kaiser Permanente', 'Seattle', 'But', 'How', 'Europe', 'Spain', 'An', 'Madrid', 'Covid', 'In France', 'Tuesday', 'In', 'UK', 'Germany', 'In', 'TV', 'Chancellor Angela Merkel', 'Germans', 'Since German', 'World War Two', 'Belgium', 'Coronavirus', 'Venice', 'What', 'Europe', 'Travellers', 'EU', 'The', 'Europeans', 'UK', 'Brexit', 'UK', 'EU', 'The', 'EU', 'Schengen', 'Iceland', 'Switzerland', 'Norway', 'Liechtenstein', 'All', 'Germany', 'Morocco', 'Egypt', 'Philippines', 'Argentina', 'In']

Homework 8 Review: Part 3 • One common method: • Use a list of

Homework 8 Review: Part 3 • One common method: • Use a list of stopwords (functional vs. content words). • Filter them out from the answers • 10 most common words in the (1 million word) Brown corpus: {the, of, and, to, a, in, that, is, was, he} ['Italy', 'Quartieri Spagnoli', 'Naples', 'Italy', 'March', 'EPAn. The', 'Italy', 'There', 'Lombardy', 'Italy', 'China', 'At', 'China', 'The', 'Europe', 'Western Pacific', 'Asia', 'World Health Organization', 'WHO', 'US', 'Canada', 'Eurovision Song Contest', 'Many', 'But', 'WHO', 'Tedros Ghebreyesus', 'Wednesday', 'Italy', 'People', 'The WHO', 'Mike Ryan', 'Dr Tedros', 'Reportn. The', 'US', 'Kaiser Permanente', 'Seattle', 'But', 'How', 'Europe', 'Spain', 'An', 'Madrid', 'Covid', 'In France', 'Tuesday', 'In', 'UK', 'Germany', 'In', 'TV', 'Chancellor Angela Merkel', 'Germans', 'Since German', 'World War Two', 'Belgium', 'Coronavirus', 'Venice', 'What', 'Europe', 'Travellers', 'EU', 'The', 'Europeans', 'UK', 'Brexit', 'UK', 'EU', 'The', 'EU', 'Schengen', 'Iceland', 'Switzerland', 'Norway', 'Liechtenstein', 'All', 'Germany', 'Morocco', 'Egypt', 'Philippines', 'Argentina', 'In'] (Down to 10 if we use the 10 word stoplist)

Homework 8 Review: Part 3 • How to determine word frequency (assuming you've installed

Homework 8 Review: Part 3 • How to determine word frequency (assuming you've installed nltk and nltk_corpus): >>> import nltk >>> from nltk. corpus import brown >>> words = brown. words() >>> len(words) 1161192 >>> fd = nltk. Freq. Dist(words) >>> fd. most_common(20) [('the', 62713), (', ', 58334), ('. ', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)] >>> words 2 = [w. lower() for w in words] >>> fd 2 = nltk. Freq. Dist(words 2) >>> fd 2. most_common(20) [('the', 69971), (', ', 58334), ('. ', 49346), ('of', 36412), ('and', 28853), ('to', 26158), ('a', 23195), ('in', 21337), ('that', 10594), ('is', 10109), ('was', 9815), ('he', 9548), ('for', 9489), ('``', 8837), ("''", 8789), ('it', 8760), ('with', 7289), ('as', 7253), ('his', 6996), ('on', 6741)]

More Python Regex practice? Zoom session • See wordlist. py on the course website.

More Python Regex practice? Zoom session • See wordlist. py on the course website. ling 388 -20$ python 3 -i wordlist. py >>> len(wordlist) 56057 • Exercise 1: • use re. search in a loop or list comprehension • Exercise 2: • use re. match in a loop or list comprehension • Exercise 3: • use re. match and word slice condition in a loop or list comprehension • find words ending in zac. How many are there? • Recall: meta-character for the end of line anchor is $ • find words beginning with anti. How many are there? • Hint: some cases may begin with a capital letter • Look for words with prefix "pre" • Are all of them correct? (cf. pretend) • Devise a search that looks for words beginning with 'pre' but also contains the rest of the word as a word in the Brown corpus. Does that solve the problem?

www. nltk. org

www. nltk. org

Platforms • Today I'll go through installation for: 1. Mac. OS (and Linux) 2.

Platforms • Today I'll go through installation for: 1. Mac. OS (and Linux) 2. Windows 10 • Your homework assignment: • install nltk, and • check to see if it works (by next time) • In the next few lectures: • http: //www. nltk. org/book/

NLTK 3. 4. 5 Install • NLTK requires Python versions 2. 7, 3. 5,

NLTK 3. 4. 5 Install • NLTK requires Python versions 2. 7, 3. 5, 3. 6, or 3. 7 • See http: //www. nltk. org/install. ht ml • Use pip 3 (for python 3) to install packages from the Python Package Index (Py. PI) (sudo – means execute the pip 3 command with admin privilege) sudo pip 3 install -U nltk

NLTK Data Install • See http: //www. nltk. org/data. html • python 3 •

NLTK Data Install • See http: //www. nltk. org/data. html • python 3 • If you get an SSL certificate error message, run: /Applications/Python 3. 6/Install Certificates. command

NLTK Data Install (WARNING!) • Don't use the built-in Python 3 from Mac. OS

NLTK Data Install (WARNING!) • Don't use the built-in Python 3 from Mac. OS Catalina • No more GUI! • Also throws a SSL: CERTIFICATE_VERIFY _FAILED error

NLTK Data Install: SSL: CERTIFICATE_VERIFY_FAILED • Workaround: ~$ /usr/bin/sudo /bin/mkdir /Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3.

NLTK Data Install: SSL: CERTIFICATE_VERIFY_FAILED • Workaround: ~$ /usr/bin/sudo /bin/mkdir /Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc ~$ /usr/bin/python 3 -c 'import ssl; print(ssl. get_default_verify_paths())' Default. Verify. Paths(cafile=None, capath=None, openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='/Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc/ssl/cert. pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='/Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc/ssl/certs') ~$ /usr/bin/sudo /bin/ln -s /etc/ssl/ /Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc/ ~$ /usr/bin/python 3 -c 'import ssl; print(ssl. get_default_verify_paths())' Default. Verify. Paths(cafile='/Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc/ssl/cert. pem', capath='/Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc/ssl/certs', openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='/Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc/ssl/cert. pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='/Applications/Xcode. app/Contents/Developer/Library/Frameworks/Python 3. framework/Versions/3. 7/etc/ssl/certs') ~$ which python 3 /usr/bin/python 3 ~$ which pip 3 /usr/bin/pip 3 • Details: https: //stackoverflow. com/questions/57630314/ssl-certificate-verify-failed-error-with-python 3 -on-macos-10 -15

Also Tcl/Tk is broken on Mac. OS Catalina https: //www. python. org/download/mac/tcltk/ So nltk

Also Tcl/Tk is broken on Mac. OS Catalina https: //www. python. org/download/mac/tcltk/ So nltk graphics won't work…

For Mac. OS, install from python. org • Python 3. 8. 1

For Mac. OS, install from python. org • Python 3. 8. 1

For Mac. OS, install from python. org

For Mac. OS, install from python. org

For Mac. OS, install from python. org ~$ which python 3 /Library/Frameworks/Python. framework/Versions/3. 8/bin/python

For Mac. OS, install from python. org ~$ which python 3 /Library/Frameworks/Python. framework/Versions/3. 8/bin/python 3 ~$ which pip 3 /Library/Frameworks/Python. framework/Versions/3. 8/bin/pip 3 ~$ echo $PATH /Library/Frameworks/Python. framework/Versions/3. 8/bin: /usr/local/bin: /usr/bin: /usr/sbin: /Library/Te. X/texbin: /o pt/X 11/bin

Install nltk (on Mac. OS) ~$ python 3 Python 3. 8. 1 (v 3.

Install nltk (on Mac. OS) ~$ python 3 Python 3. 8. 1 (v 3. 8. 1: 1 b 293 b 6006, Dec 18 2019, 14: 08: 53) [Clang 6. 0 (clang-600. 0. 57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import nltk Traceback (most recent call last): File "<stdin>", line 1, in <module> Module. Not. Found. Error: No module named 'nltk' >>> ^D ~$ pip 3 install --user -U nltk Collecting nltk Using cached https: //files. pythonhosted. org/packages/f 6/1 d/d 925 cfb 4 f 324 ede 997 f 6 d 47 bea 4 d 9 babba 51 b 49 e 87 a 767 c 170 b 77005889 d/nltk-3. 4. 5. zip Collecting six (from nltk) Downloading https: //files. pythonhosted. org/packages/65/26/32 b 8464 df 2 a 97 e 6 dd 1 b 656 ed 26 b 2 c 194606 c 16 fe 163 c 695 a 992 b 36 c 11 cdf/six-1. 13. 0 -py 2. py 3 -none-any. whl Installing collected packages: six, nltk Running setup. py install for nltk. . . done Successfully installed nltk-3. 4. 5 six-1. 13. 0

Install numpy ~$ pip 3 install --user -U numpy Collecting numpy Downloading https: //files.

Install numpy ~$ pip 3 install --user -U numpy Collecting numpy Downloading https: //files. pythonhosted. org/packages/a 7/06/6 d 616 fb 5 fb 423 db 595 b 1502 cbd 873 f 3 f 2025 f 2 fd 8509046 c 771 a 20 c 4302 a/numpy-1. 18. 1 -cp 38 macosx_10_9_x 86_64. whl (15. 2 MB) |████████████████| 15. 2 MB 2. 0 MB/s Installing collected packages: numpy WARNING: The scripts f 2 py, f 2 py 3 and f 2 py 3. 8 are installed in '/Users/sandiway/Library/Python/3. 8/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. Successfully installed numpy-1. 18. 1

Install matplotlib ~$ python 3 -m pip install -U matplotlib Collecting matplotlib Downloading https:

Install matplotlib ~$ python 3 -m pip install -U matplotlib Collecting matplotlib Downloading https: //files. pythonhosted. org/packages/52/dd/ffb 5 cad 3 cf 2 f 41 bc 3966489709 e 4 e 020 a 34 f 8 d 183 fe 85 c 91 dc 8 a 3 db 8 bcf 5/matplotlib-3. 1. 2 -cp 38 -macosx_10_9_x 86_64. whl (13. 2 MB) ~$ python 3 Python 3. 8. 1 (v 3. 8. 1: 1 b 293 b 6006, Dec 18 2019, 14: 08: 53) Downloading https: //files. pythonhosted. org/packages/d 4/70/d 60450 c 3 dd 48 ef 87586924207 ae 8907090 de 0 b 306 af 2 bce 5 d 134 d 78615 cb/python_dateutil-2. 8. 1 -py 2. py 3 -none-any. whl (227 k. B) [Clang 6. 0 (clang-600. 0. 57)] on darwin |████████████████| 235 k. B 1. 4 MB/s Type "help", "copyright", "credits" or "license" for more information. Collecting pyparsing!=2. 0. 4, !=2. 1. 2, !=2. 1. 6, >=2. 0. 1 (from matplotlib) >>> import matplotlib. pyplot as plt |████████████████| 13. 2 MB 1. 7 MB/s Collecting python-dateutil>=2. 1 (from matplotlib) Downloading https: //files. pythonhosted. org/packages/5 d/bc/1 e 58593167 fade 7 b 544 bfe 9502 a 26 dc 860940 a 79 ab 306 e 651 e 7 f 13 be 68 c 2/pyparsing-2. 4. 6 -py 2. py 3 -none-any. whl (67 k. B) |████████████████| 71 k. B 1. 3 MB/s Collecting kiwisolver>=1. 0. 1 (from matplotlib) Downloading https: //files. pythonhosted. org/packages/22/a 7/8 f 7706 e 8 c 1 e 847 b 9816 bbb 2 c 4 c 341 e 2 ae 9568653 f 17956 f 839574 ed 62815/kiwisolver-1. 1. 0 -cp 38 -macosx_10_9_x 86_64. whl (61 k. B) |████████████████| 71 k. B 1. 4 MB/s Requirement already satisfied, skipping upgrade: numpy>=1. 11 in. /Library/Python/3. 8/lib/python/site-packages (from matplotlib) (1. 18. 1) Collecting cycler>=0. 10 (from matplotlib) Using cached https: //files. pythonhosted. org/packages/f 7/d 2/e 07 d 3 ebb 2 bd 7 af 696440 ce 7 e 754 c 59 dd 546 ffe 1 bbe 732 c 8 ab 68 b 9 c 834 e 61/cycler-0. 10. 0 -py 2. py 3 -none-any. whl Requirement already satisfied, skipping upgrade: six>=1. 5 in. /Library/Python/3. 8/lib/python/site-packages (from python-dateutil>=2. 1 ->matplotlib) (1. 13. 0) Requirement already satisfied, skipping upgrade: setuptools in /Library/Frameworks/Python. framework/Versions/3. 8/lib/python 3. 8/site-packages (from kiwisolver>=1. 0. 1 ->matplotlib) (41. 2. 0) Installing collected packages: python-dateutil, pyparsing, kiwisolver, cycler, matplotlib Successfully installed cycler-0. 10. 0 kiwisolver-1. 1. 0 matplotlib-3. 1. 2 pyparsing-2. 4. 6 python-dateutil-2. 8. 1 WARNING: You are using pip version 19. 2. 3, however version 19. 3. 1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.

nltk_data ~$ python 3 Python 3. 8. 1 (v 3. 8. 1: 1 b

nltk_data ~$ python 3 Python 3. 8. 1 (v 3. 8. 1: 1 b 293 b 6006, Dec 18 2019, 14: 08: 53) [Clang 6. 0 (clang-600. 0. 57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import nltk >>> nltk. download() showing info https: //raw. githubusercontent. com/nltk_data/gh-pages/index. xml

Windows 10: setup • Environment variable PATH should be set correctly to point to

Windows 10: setup • Environment variable PATH should be set correctly to point to Python 3 install directory • Type in search: • Edit environment variables for your account

Windows 10: install nltk • On the command line: • pip 3 install pyyaml

Windows 10: install nltk • On the command line: • pip 3 install pyyaml nltk • Package pyyaml must be used somewhere in nltk … • Source: http: //www. pitt. edu/~naraehan/pyth on 2/faq. html

Windows 10: install numpy and test nltk • On the command line: • pip

Windows 10: install numpy and test nltk • On the command line: • pip 3 install numpy • (the chunking algorithm uses it) • Let's test nltk: • . word_tokenize() converts a string into words • . pos_tag() does part -of-speech tagging • . ne_chunk() does named entity recognition

Windows 10: test nltk • . draw() takes a Tree object and draws it

Windows 10: test nltk • . draw() takes a Tree object and draws it in a pop-up window

Windows 10: install nltk data • Install corpus data (from inside Python) using •

Windows 10: install nltk data • Install corpus data (from inside Python) using • nltk. download()

Test • >>> from nltk. corpus import treebank • >>> t = treebank. parsed_sents()

Test • >>> from nltk. corpus import treebank • >>> t = treebank. parsed_sents() • >>> len(t) • 3914 • >>> t[-1]. draw() • There is a sample of the wellknown Penn Treebank Wall Street Journal (WSJ) corpus included • 3, 914 parsed sentences • 49, 000+ parsed sentences in the full corpus

Should also install Matplotlib • For nltk charts and graphics. • More on this

Should also install Matplotlib • For nltk charts and graphics. • More on this next lecture…