Corpus Processing and NLP 1 What is NLP

  • Slides: 24
Download presentation
Corpus Processing and NLP 1

Corpus Processing and NLP 1

What is NLP? • Natural Language Processing – natural language vs. computer languages •

What is NLP? • Natural Language Processing – natural language vs. computer languages • Other names – Computational Linguistics • emphasizes scientific not technological – Language Engineering • official European Union term, ca 1996 -99 – Human Language Technology (HLT) • preferred EU and US Government term) – Language Technology 2

NLP and linguistics L I N G supply ideas interpret results test theories expose

NLP and linguistics L I N G supply ideas interpret results test theories expose gaps N L P plus turn into technology 3

Example: regular morphology LINGUISTICS: – Rules: stems -> inflected forms NLP: – program the

Example: regular morphology LINGUISTICS: – Rules: stems -> inflected forms NLP: – program the rules – apply rules to a lexicon of stems – Is the output correct? Errors? LINGUISTICS: – refine theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc. 4

Application areas • web search – Basic search – Filtering results • spelling and

Application areas • web search – Basic search – Filtering results • spelling and grammar checking • machine translation (MT) • talking to computers – speech processing as well • information extraction (IE) – finding facts in a database of documents; populating a database, answering questions 5

How can NLP make better dictionaries? By pre-processing a corpus: • • • tokenization

How can NLP make better dictionaries? By pre-processing a corpus: • • • tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors 6

Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive. 7

Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive. 7

Automatic tokenization • Western writing systems – easy! space is separator • Chinese, Japanese,

Automatic tokenization • Western writing systems – easy! space is separator • Chinese, Japanese, some other writing systems – do not use word-separator – hard • like POS-tagging (below) 8

Why isn't space=separator enough (even for English)? • what is a space – linebreaks,

Why isn't space=separator enough (even for English)? • what is a space – linebreaks, paragraph breaks, tabs • Punctuation – characters do not form parts of words but may be attached to words (with no spaces) • brackets, quotation marks • Hyphenation – is co-op one word or two? is well-managed? 9

Sentence splitting “identifying the sentences” from: he didn't arrive. to: He did n’t arrive.

Sentence splitting “identifying the sentences” from: he didn't arrive. to: He did n’t arrive. to: <s> He did n’t arrive. </s> 10

Lemmatization Mapping from text-word to lemma help (verb) text-word to lemma help (v) helps

Lemmatization Mapping from text-word to lemma help (verb) text-word to lemma help (v) helps help (v) helping help (v) helped help (v) . 11

Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word to

Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word to lemma help (v), help (n) helps help (v), helps (n)** helping help (v), helping (n) helped help (v) helpings helping (n) **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending. . 12

Lemmatization Dictionary entries are for lemmas so lemmatization is required for a match between

Lemmatization Dictionary entries are for lemmas so lemmatization is required for a match between text-word and dictionary-word . 13

Lemmatization • Searching by lemma – English: little inflection – French: 36 forms per

Lemmatization • Searching by lemma – English: little inflection – French: 36 forms per verb – Finno-Ugric: 2000. • Not always wanted: – English royalty • singular: kings and queens • plural royalties: payments to authors 14

Automatic lemmatization • Write rules: – if word ends in "ing", delete "ing"; –

Automatic lemmatization • Write rules: – if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas • If detailed grammar available, use it • full lemma list is also required – Often available from dictionary companies 15

Part-of-speech (POS) tagging “identifying parts of speech” from: he didn't arrive. to: … .

Part-of-speech (POS) tagging “identifying parts of speech” from: he didn't arrive. to: … . to: <s> He did n’t arrive verb . </s> PNP pers pronoun VVD past tense verb XNOT not VV base form of C punctuation 16

Tagsets • The set of part-of-speech tags to choose between – Basic: noun, verb,

Tagsets • The set of part-of-speech tags to choose between – Basic: noun, verb, pronoun … – Advanced: examples - CLAWS English tagset • NN 2 • VVG plural noun -ing form of lexical verb • Based on linguistics of the language. 17

POS-tagging: why? • Use grammar when searching – Nouns modified by buckle – Verbs

POS-tagging: why? • Use grammar when searching – Nouns modified by buckle – Verbs that buckle is object of 18

POS-tagging: how? • Big topic for computational linguistics – well understood – taggers available

POS-tagging: how? • Big topic for computational linguistics – well understood – taggers available for major languages • Some taggers use lemmatized input, others do not • Methods – constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB – Statistical: • Machine learning from tagged corpus • Various methods • Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999. 19

Parsing • Find the structure: – Phrase structure (trees) The cat sat on the

Parsing • Find the structure: – Phrase structure (trees) The cat sat on the – Dependency structure (links) – The cat sat on the mat 20

Automatic parsing • Big topic – see Jurafsky and Martin or other NLP textbook

Automatic parsing • Big topic – see Jurafsky and Martin or other NLP textbook • Many methods too slow for large corpora • Sketch Engine usually uses “shallow parsing” – Patterns of POS-tags – Regular expressions 21

Regular expressions • Search for any pattern • Very useful in lots of places

Regular expressions • Search for any pattern • Very useful in lots of places • Exercises – http: //www. sketchengine. co. uk/exercises/regex Madrid 2010 Kilgarriff: Corpus Processing and NLP

Summary • What is NLP? • How can it help? – Tokenizing – Sentence

Summary • What is NLP? • How can it help? – Tokenizing – Sentence splitting – Lemmatizing – POS-tagging – Parsing 23

Exercise • • A sentence of your language A tagset of your language Tokenize

Exercise • • A sentence of your language A tagset of your language Tokenize For each word, decide – What is the lemma (doesn’t apply in Chinese) – Which tag applies Word Visiting relatives … Lemma visit relative Tag VVG NN 2 24