Superficial Lexical level 1 Superficial level What is

  • Slides: 15
Download presentation
Superficial & Lexical level 1 • • • Superficial level What is a word

Superficial & Lexical level 1 • • • Superficial level What is a word Lexical level Lexicons How to acquire lexical information NLP superficial and lexic level 1

Superficial level 1 • Textual pre-process • Getting the document(s) • Accessing a BD

Superficial level 1 • Textual pre-process • Getting the document(s) • Accessing a BD • Accessing the Web (wrappers) • Getting the textual fragments of a document • Multimedia documents, Web pages, . . . • Filtering out meta-information • tags HTML, XML, . . . NLP superficial and lexic level 2

Superficial level 2 • • Text segmentation into paragraphs or sentences Tokenization • orthographic

Superficial level 2 • • Text segmentation into paragraphs or sentences Tokenization • orthographic vs grammatical word • Multiword terms • dates, formulas, acronyms, abbreviations, quantities (and units), idioms, • Named entities • NER, NEC, NERC Beeferman et al, 1999 Ratnaparkhi, 1998 Bikel et al, 1999 Borthwick, 1999 Mikheev et al, 1999 • Unknown word • Language identification NLP superficial and lexic level Elworthy, 1999 Adams, Resnik, 1997 3

Superficial level 3 • Vocabulary size (V) • Heap's Law • • V =

Superficial level 3 • Vocabulary size (V) • Heap's Law • • V = KN K depends on the text 10 K 100 N total number of words depends on the language, for English 0. 4 0. 6 Different words • Vocabulary grows sublinealy but does not saturate • tends to stabilize for 1 Mb of text (150. 000 w) NLP superficial and lexic level words 4

Superficial level 4 • • word tokens vs word types Statistical distribution of words

Superficial level 4 • • word tokens vs word types Statistical distribution of words in a document • Obviously non uniform • Most common words cover more than 50% of occurrences • 50% of the words only occur once • ~12% of the document is formed by word occurring less than 4 times. NLP superficial and lexic level 5

Superficial level 5 Zipf law: We sort the words occurring in a document by

Superficial level 5 Zipf law: We sort the words occurring in a document by their frequency. The product of the frequency of a word (f) by its position (r) is aproximatelly constant NLP superficial and lexic level 6

Lexical level 1 • Part of Speech (POS) • Formal property of a word-type

Lexical level 1 • Part of Speech (POS) • Formal property of a word-type determining its acceptable uses in syntax. • • • A POS can be seen as a class of words A word-type can own several POS, a word-token only one Plain categories • open, many elements, neologisms, independent and semantically rich classes • N, Adj, Adv, V • Functional categories • closed NLP superficial and lexic level 7

Lexical level 2 Lexicon • • Repository of lexical information for human or computer

Lexical level 2 Lexicon • • Repository of lexical information for human or computer use Two aspects to consider • Representation of lexical information • Acquisition of lexical information NLP superficial and lexic level 8

Lexical level 3 Lexicon content • • Orthografic Transcription Phonetic Transcription Flexion model diathesis

Lexical level 3 Lexicon content • • Orthografic Transcription Phonetic Transcription Flexion model diathesis alternations, subcategorization frames • LOVE VTR (OBJLIST: SN). • LOVE • CAT = VERB • SUBCAT = <SN, SN> NLP superficial and lexic level 9

Lexical level 4 • • • POS Argument structure Semantic information • dictionaries =>

Lexical level 4 • • • POS Argument structure Semantic information • dictionaries => definition • lexicons => semantic types predefined in a hierarchy. • Lexical Relations • derivation • Equivalence with other languages NLP superficial and lexic level 10

Lexical level 5 Problems • Form • attribute/value pairs, binarr or n-ary relations, coded

Lexical level 5 Problems • Form • attribute/value pairs, binarr or n-ary relations, coded values, open domain values… • Multiple assignments • One to many and many to one relations • Contextual dependencies … • Facets of features • Mandatory or optional, cardinality, default values • Grading • Exact values, preferences, probabilistic assigments. NLP superficial and lexic level 11

Lexical level 6 Representation • • General purpose databases Textual databases Lexical databases OO

Lexical level 6 Representation • • General purpose databases Textual databases Lexical databases OO formalisms OO databases Frames Unification-based formalisms NLP superficial and lexic level 12

Lexical level 7 Lexical Information acquisition • Dictionaries • • • MRD Predefined internal

Lexical level 7 Lexical Information acquisition • Dictionaries • • • MRD Predefined internal structure Some degree of coding in some contents Internal relations (synonimy, hyponymy, . . . ) (sometimes) restricted vocabulary Some sistematics on building definitions NLP superficial and lexic level 13

Lexical level 8 Information present in corpora • • • Colocations Argument structure. Frecuency

Lexical level 8 Information present in corpora • • • Colocations Argument structure. Frecuency information Context Grammatical Induction Probabilistic Analysis. Lexical relations Examples of use. Selectional Restrictions Nominal compounds Idioms, . . . NLP superficial and lexic level 14

Lexical level 9 Corpus typology • • • Raw corpus Horizontal or vertical Corpus

Lexical level 9 Corpus typology • • • Raw corpus Horizontal or vertical Corpus Tagged corpora Parenthized corpora Treebanks NLP superficial and lexic level 15