Corpus Processing and NLP 1 What is NLP
- Slides: 24
Corpus Processing and NLP 1
What is NLP? • Natural Language Processing – natural language vs. computer languages • Other names – Computational Linguistics • emphasizes scientific not technological – Language Engineering • official European Union term, ca 1996 -99 – Human Language Technology (HLT) • preferred EU and US Government term) – Language Technology 2
NLP and linguistics L I N G supply ideas interpret results test theories expose gaps N L P plus turn into technology 3
Example: regular morphology LINGUISTICS: – Rules: stems -> inflected forms NLP: – program the rules – apply rules to a lexicon of stems – Is the output correct? Errors? LINGUISTICS: – refine theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc. 4
Application areas • web search – Basic search – Filtering results • spelling and grammar checking • machine translation (MT) • talking to computers – speech processing as well • information extraction (IE) – finding facts in a database of documents; populating a database, answering questions 5
How can NLP make better dictionaries? By pre-processing a corpus: • • • tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors 6
Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive. 7
Automatic tokenization • Western writing systems – easy! space is separator • Chinese, Japanese, some other writing systems – do not use word-separator – hard • like POS-tagging (below) 8
Why isn't space=separator enough (even for English)? • what is a space – linebreaks, paragraph breaks, tabs • Punctuation – characters do not form parts of words but may be attached to words (with no spaces) • brackets, quotation marks • Hyphenation – is co-op one word or two? is well-managed? 9
Sentence splitting “identifying the sentences” from: he didn't arrive. to: He did n’t arrive. to: <s> He did n’t arrive. </s> 10
Lemmatization Mapping from text-word to lemma help (verb) text-word to lemma help (v) helps help (v) helping help (v) helped help (v) . 11
Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word to lemma help (v), help (n) helps help (v), helps (n)** helping help (v), helping (n) helped help (v) helpings helping (n) **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending. . 12
Lemmatization Dictionary entries are for lemmas so lemmatization is required for a match between text-word and dictionary-word . 13
Lemmatization • Searching by lemma – English: little inflection – French: 36 forms per verb – Finno-Ugric: 2000. • Not always wanted: – English royalty • singular: kings and queens • plural royalties: payments to authors 14
Automatic lemmatization • Write rules: – if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas • If detailed grammar available, use it • full lemma list is also required – Often available from dictionary companies 15
Part-of-speech (POS) tagging “identifying parts of speech” from: he didn't arrive. to: … . to: <s> He did n’t arrive verb . </s> PNP pers pronoun VVD past tense verb XNOT not VV base form of C punctuation 16
Tagsets • The set of part-of-speech tags to choose between – Basic: noun, verb, pronoun … – Advanced: examples - CLAWS English tagset • NN 2 • VVG plural noun -ing form of lexical verb • Based on linguistics of the language. 17
POS-tagging: why? • Use grammar when searching – Nouns modified by buckle – Verbs that buckle is object of 18
POS-tagging: how? • Big topic for computational linguistics – well understood – taggers available for major languages • Some taggers use lemmatized input, others do not • Methods – constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB – Statistical: • Machine learning from tagged corpus • Various methods • Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999. 19
Parsing • Find the structure: – Phrase structure (trees) The cat sat on the – Dependency structure (links) – The cat sat on the mat 20
Automatic parsing • Big topic – see Jurafsky and Martin or other NLP textbook • Many methods too slow for large corpora • Sketch Engine usually uses “shallow parsing” – Patterns of POS-tags – Regular expressions 21
Regular expressions • Search for any pattern • Very useful in lots of places • Exercises – http: //www. sketchengine. co. uk/exercises/regex Madrid 2010 Kilgarriff: Corpus Processing and NLP
Summary • What is NLP? • How can it help? – Tokenizing – Sentence splitting – Lemmatizing – POS-tagging – Parsing 23
Exercise • • A sentence of your language A tagset of your language Tokenize For each word, decide – What is the lemma (doesn’t apply in Chinese) – Which tag applies Word Visiting relatives … Lemma visit relative Tag VVG NN 2 24
- Lutalphase
- Stroma
- Natural language processing nlp - theory lecture
- Natural language processing lecture notes
- Point processing and neighbourhood processing
- Secondary processing of food
- What is interactive processing
- Top-down processing
- Bottom up processing vs top down processing
- Bottom-up processing example
- Point processing
- Histogram processing in digital image processing
- Parallel processing vs concurrent processing
- Laplacian filter
- Point processing in image processing
- Digital image processing
- Top bottom processing
- Function medulla oblongata
- Z-brain
- The writ of habeas corpus and the grand jury both
- Language policy and planning ppt
- Histogen theory diagram
- Armenian national corpus
- Corpus luteum secretes
- Corpus luteum secretes progesterone