NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr Eleni

  • Slides: 29
Download presentation
NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics,

NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL Dr. Eleni Galiotou Assistant Professor, Department of Informatics, TEI of Athens Feb 24, 2004 Eleni Galiotou : Tempus Seminar

(Computer-based) Information Retrieval • Locate (electronically available) documents satisfying user´s information needs • Information

(Computer-based) Information Retrieval • Locate (electronically available) documents satisfying user´s information needs • Information need: A statement in a query language matched against document surrogates (title, abstract, keywords etc) • Outcome of IR process: articles, memos, reports, books, annotated image and sound files Feb 24, 2004 Eleni Galiotou : Tempus Seminar 2

The IR strategy • Purpose: – Retrieve all relevant documents – Retrieve as few

The IR strategy • Purpose: – Retrieve all relevant documents – Retrieve as few of non-relevant documents as possible • Techniques in classical IR: – Empirical and ad-hoc – Quantitative methods • IR : also a Natural Language Processing problem • Heterogeneous Collections of full-text documents – Need for Content Understanding => NLP techniques Feb 24, 2004 Eleni Galiotou : Tempus Seminar 3

Main areas of research in IR • Content analysis • Relationships between documents to

Main areas of research in IR • Content analysis • Relationships between documents to improve efficiency and effectiveness of IR strategies • Measurement of effectiveness of retrieval Feb 24, 2004 Eleni Galiotou : Tempus Seminar 4

Example: The vector-space model (1) • The SMART Text Retrieval System • Documents and

Example: The vector-space model (1) • The SMART Text Retrieval System • Documents and queries represented as vectors in T-dimensional space ( T: number of distinct terms in document collection) • Automated indexing: Assigning of terms to a piece of text • Weighted terms to reflect their relative importance in the text Feb 24, 2004 Eleni Galiotou : Tempus Seminar 5

The vector-space model (2) • Result of a query: Ranked list of documents ordered

The vector-space model (2) • Result of a query: Ranked list of documents ordered by similarity to the query • Similarity measure: cosine of the angle formed by the query and the document vector (cosine correlation) qidi T sim = cos(Q, D) = i=1 ( qi 2 di 2 )1/2 Feb 24, 2004 T T i=1 Eleni Galiotou : Tempus Seminar 6

Extended vector-space model • Vector : collection of subvectors used to represent different aspects

Extended vector-space model • Vector : collection of subvectors used to represent different aspects of documents in collection • Overall similarity between two extended vectors: sim(Q, D) = αi simi (Qi , Di) subvector i ai = importance of subvector i in the overall similarity between texts Feb 24, 2004 Eleni Galiotou : Tempus Seminar 7

Indexing • Index language used to describe documents and requests • Pre-coordinate index terms

Indexing • Index language used to describe documents and requests • Pre-coordinate index terms : logical combination of any index terms used as a label to identify a class of documents • Post-coordinate terms: combination of classes of documents labeled with the individual index terms Feb 24, 2004 Eleni Galiotou : Tempus Seminar 8

LMI vs. NLI • Non-Linguistic Indexing: Removing stopwords, Applying statistical criteria • Linguistically Motivated

LMI vs. NLI • Non-Linguistic Indexing: Removing stopwords, Applying statistical criteria • Linguistically Motivated Indexing: – Applying syntactic and/or semantic techniques for term identification and description formation – Identifying multi-word units and characterizing their internal structure • NLP needed for automated indexing ? ! Feb 24, 2004 Eleni Galiotou : Tempus Seminar 9

Index Term Weighting • tf (t): within-document frequency of term t • idf(t): inverse

Index Term Weighting • tf (t): within-document frequency of term t • idf(t): inverse document frequency = log (N/n) N= total number of documents in collection n : number of documents containing term t • General weighting schema: w(t) = tf(t) X idf(t) • Assumptions on term independence often false • Situation worse when single-word terms are intermixed with phrasal terms Feb 24, 2004 Eleni Galiotou : Tempus Seminar 10

NLP Based Indexing • Example: TREC Experiments – “joint venture” important in Wall Street

NLP Based Indexing • Example: TREC Experiments – “joint venture” important in Wall Street Journal database – “joint”, “venture” dropped from list of terms by the system because of too low idf • Identify groups creating meaningful phrases • Simple collocations, Statistically-validated N-grams, Part-Of-Speech tagged sequences, Syntactic structures, Semantic Concepts Feb 24, 2004 Eleni Galiotou : Tempus Seminar 11

Obstacles in the application of NLP techniques in IR • Lack of robustness and

Obstacles in the application of NLP techniques in IR • Lack of robustness and efficiency • Representations produced : Complex structures effectively compared to determine relevance Solution: Use NLP to assist IR system (boolean, statistical, probabilistic) in representing documents for search purposes Off-line database indexing Feb 24, 2004 Eleni Galiotou : Tempus Seminar 12

Stream-based IR Model (1) • Combination of Statistical and NLP Techniques • Term Extraction

Stream-based IR Model (1) • Combination of Statistical and NLP Techniques • Term Extraction Steps 1. Elimination of Stopwords (no-content or low content words: determiners, preposition, pronouns, very frequent words) 2. Morphological Stemming: Affix-stripping process or Morphological Analysis) Feb 24, 2004 Eleni Galiotou : Tempus Seminar 13

Stream-based IR Model (2) 4. Phrase Extraction : Shallow text processing techniques (POS tagging,

Stream-based IR Model (2) 4. Phrase Extraction : Shallow text processing techniques (POS tagging, Phrase boundary detection, Word co-occurrence metrics) used to identify relatively stable groups of words 5. Phrase Normalization: “Head+Modifier” pairs to normalize across syntactic variants and reduce to a common “concept” , e. g. weapon proliferation, proliferation of weapons weapon+ proliferate 6. Proper Name Extraction: People names and titles, Location names, Organization names used for indexing Feb 24, 2004 Eleni Galiotou : Tempus Seminar 14

Stream-based IR Model (3) • Final results: merge ranked lists of documents obtained from

Stream-based IR Model (3) • Final results: merge ranked lists of documents obtained from searching all streams with appropriately preprocessed queries. • Contributions from each stream are weighted using an effective combination of alternative retrieval and routing methods Meta-search strategy which maximizes contributions of each stream (base search engines: SMART v. 11, PRISE v. 2 e. t. c) Feb 24, 2004 Eleni Galiotou : Tempus Seminar 15

Advantages of Stream Architecture • Easier to compare contributions of different indexing features or

Advantages of Stream Architecture • Easier to compare contributions of different indexing features or representations • Convenient testbed to experiment with algorithms designed to merge results obtained using different IR engines and/ot techniques • Easier to fine-tune system in order to obtain optimum performance • Allows usage of IR engines without having to adopt them Feb 24, 2004 Eleni Galiotou : Tempus Seminar 16

Part Of Speech Tagging (1) • Allows resolution of lexical ambiguities in a running

Part Of Speech Tagging (1) • Allows resolution of lexical ambiguities in a running text assuming a known general type of text and a context in which a word is used more accurate lexical normalization, phrase boundary detection • Assigns POS label(s) to each word in a text depending on labels assigned to preceding words Feb 24, 2004 Eleni Galiotou : Tempus Seminar 17

POS Tagging (2) • Best-tag-only option: Only top-ranked for each word is output gain

POS Tagging (2) • Best-tag-only option: Only top-ranked for each word is output gain in speed and robustness of subsequent processes (e. g. parsing) • Brill´s rule based Tagger trained on Wall Street Journal texts to preprocess linguistic streams used by SMART Feb 24, 2004 Eleni Galiotou : Tempus Seminar 18

Syntactic Tagging (1) • Capturing semantic dependencies critical for accurate text indexing • Need

Syntactic Tagging (1) • Capturing semantic dependencies critical for accurate text indexing • Need to exploit syntactic structures produced by a fairly comprehensive parser • TREC experiment: TTP (Tagged Text Parser) based on Linguistic String Grammar • Full grammar parser with a built-in timer regulating amount of time allowed for parsing a sentence Feb 24, 2004 Eleni Galiotou : Tempus Seminar 19

Syntactic Tagging (2) • If no parse is returned before allotated time elapses parser

Syntactic Tagging (2) • If no parse is returned before allotated time elapses parser in “skip-and-fit” mode • Result: approximate parse • Fragments skipped in first pass: – analyzed by simple phrasal parser looking for noun phrases and relative clauses – attached to main parse structure Feb 24, 2004 Eleni Galiotou : Tempus Seminar 20

Corpus-based disambiguation of long Noun Phrases (1) • Relationships between in complex phases required

Corpus-based disambiguation of long Noun Phrases (1) • Relationships between in complex phases required to decompose longer phrases into meaningful head+modifier pairs • Pair extractor looks at distribution statistics of compound terms – association between any two words in noun phrase syntactically valid and semantically significant Feb 24, 2004 Eleni Galiotou : Tempus Seminar 21

Corpus-based disambiguation (2) • Phrasal terms extracted in two phases: 1. Only unambiguous head-modifier

Corpus-based disambiguation (2) • Phrasal terms extracted in two phases: 1. Only unambiguous head-modifier pairs are generated 2. Distributional statistics gathered in first phase are used to predict the strength of alternative modifier-modified links within ambiguous phrases • Example: multiple unambiguous occurrences : “inside trading”, a few of “trading case”, numerous phrases: “insider trading case”, “insider trading legislation” Feb 24, 2004 Eleni Galiotou : Tempus Seminar 22

Language Resources • Machine Readable Dictionaries (MRD) Mixed results in experiments • Knowledge bases

Language Resources • Machine Readable Dictionaries (MRD) Mixed results in experiments • Knowledge bases – CYC : Huge Knowledge base of Common Sense Knowledge, Untested contribution to IR – Word. Net : Models Lexical Knowledge of a native user of English Feb 24, 2004 Eleni Galiotou : Tempus Seminar 23

Usage of Word. Net in Information Retrieval Tasks (1) • Word. Net: organized around

Usage of Word. Net in Information Retrieval Tasks (1) • Word. Net: organized around logical groupings of related terms (synsets) • Synset: list of synonymous word forms and semantic pointers describing relationships between current and other synsets • Knowledge Base: Nouns in Word. Net • Nouns: Most content-bearing of all word classes and occur in every sentence Feb 24, 2004 Eleni Galiotou : Tempus Seminar 24

Usage of Word. Net (2) • Word. Net partitioned into Hierarchical Concept Graphs (HCG)

Usage of Word. Net (2) • Word. Net partitioned into Hierarchical Concept Graphs (HCG) based on the IS-A hierarchical links between synsets • Information content of each synset approximated by estimating the probability of occurrence of all nouns in all subordinate synsets. • Semantic similarity between two nouns (synsets from which the nouns are drawn): information content of first synset which subsumes the two synsets Feb 24, 2004 Eleni Galiotou : Tempus Seminar 25

Usage of Word. Net (3) • Simple word sense disambiguation process for documents which

Usage of Word. Net (3) • Simple word sense disambiguation process for documents which choose the single most likely sense of a noun occurrence • Experiments: Top 1000 documents prefetched from the collection using term weighting (conventional IR technique) and exhaustive word distance based measure on these documents Feb 24, 2004 Eleni Galiotou : Tempus Seminar 26

Usage of Word. Net (4) • Retrieval effectiveness results using word distances, (in terms

Usage of Word. Net (4) • Retrieval effectiveness results using word distances, (in terms of precision and recall): poor compared to the tf X idf term weighting strategy • Possibility of errors in syntactic tagging of documents, in word sense disambiguation, in semantic matching between words. Feb 24, 2004 Eleni Galiotou : Tempus Seminar 27

Other Roles for NLP (1) • Routing (Filtering): Amount of training data is the

Other Roles for NLP (1) • Routing (Filtering): Amount of training data is the dominant factor in performance • Text categorization (automatic assignment: to prior headings): Using complex terms had no extra beneficial effect • ? Real contribution to selective contentbased information management Feb 24, 2004 Eleni Galiotou : Tempus Seminar 28

Other Roles for NLP (2) • Displaying information about whole documents: giving selected phrases

Other Roles for NLP (2) • Displaying information about whole documents: giving selected phrases more informative than highlighting matching terms or listing key individual words Information Extraction and Summarizing • Real role of NLP: Supporting more exigent information-management functions within a larger, multi-functional whole Feb 24, 2004 Eleni Galiotou : Tempus Seminar 29