Artificial intelligence natural language processing Mark Sanderson Porto

Aims • To provide an outline of the attempts made at using NLP techniques

Objectives • At the end of this lecture you will be able to –

Why? • Seems an obvious area of investigation – Why not working?

Use of NLP • Syntactic – Parsing to identify phrases – Full syntactic structure

Syntactic • Parsing to identify phrases – The issues. – Explain how it’s done

Simple phrase identification • High frequency terms could be good candidates. – Why? •

Problems • Close words that aren’t phrases. • “the use of computers in science

Parsing for phrases • Using parsers to identify noun phrases. • Make a phrase

Errors • Not a perfect rule by any means. – Need restrictions to eliminate

Do they work? • Fagan compared statistical with syntactic, statistics won, just – J.

Check out TREC • Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees,

Grammatical tagging? • Tag document text with grammatical codes? – R. Garside (1987). The

Syntactic structure comparison • Has been tried… – A. F. Smeaton & P. Sheridan

Semantic • Disambiguation – Given a word appearing in a certain context, disambiguators will

Disambiguation • Does it work? – No (well maybe) • M. Sanderson, Word sense

Partial conclusions • NLP has yet to prove itself in IR – Agree –

Mark’s idle speculation • What people think is going on always Keywords NLP

Mark’s idle speculation • What’s usually actually going on Keywords NLP

Areas where NLP does work • Systems with the following ingredients. – Collection documents

RIME & IOTA • From Grenoble – Y. Chiaramella & J. Nie (1990) A

Indexing • “an opacity affecting probably the lung and the trachea” {[p], SGN} SGN

Retrieval • How do we match a user’s query to these structures? – Using

Tree transformation {[has-for-value], SGN} {[bears-on], SGN} {[opacity], SGN} {[has-for-value], SGN} {[lung], LOC} {[contour], SGN}

Term transforms • Basic medical terms stored in a hierarchy. – Transformations possible again

Isn’t this a bit slow? • Yes • Optimisation – Scan for potential documents.

Not unique • SCISOR – P. S. Jacobs & L. F. Rau (1990) SCISOR:

Why do they work? • Because of the restrictions – Small subject domain. –

Anything else for NLP? • Text Generation – IR system explaining itself?

Conclusions • By now, you will be able to – Outline a range of

Slides: 30

Download presentation

Artificial intelligence & natural language processing Mark Sanderson Porto, 2000

Aims • To provide an outline of the attempts made at using NLP techniques in IR

Objectives • At the end of this lecture you will be able to – Outline a range of attempts to get NLP to work with IR systems – Idly speculate on why they failed – Describe the successful use of NLP in a limited domain

Why? • Seems an obvious area of investigation – Why not working?

Use of NLP • Syntactic – Parsing to identify phrases – Full syntactic structure comparison • Semantic – Building an understanding of a document’s content • Discourse – Exploiting document structure?

Syntactic • Parsing to identify phrases – The issues. – Explain how it’s done (a bit). – Is it worth it? • Other possibilities – Grammatical tagging – Full syntactic structure comparison • Explain how it’s done (a little bit). • Show results.

Simple phrase identification • High frequency terms could be good candidates. – Why? • Terms co-occurring more often than chance. – Within small number of words. – Surrounding simple terms. – Not surrounding punctuation.

Problems • Close words that aren’t phrases. • “the use of computers in science & technology” • Distant words that are phrases. • “preparation & evaluation of abstracts and extracts”

Parsing for phrases • Using parsers to identify noun phrases. • Make a phrase out of a head and the head of NP its modifiers. PP ADJ NOUN PREP ADJ NOUN “automatic analysis of scientific text”

Errors • Not a perfect rule by any means. – Need restrictions to eliminate bogus phrases. NP PP ADJ NOUN PREP DET QUANT ADJ NOUN “automatic analysis of these four scientific texts”

Do they work? • Fagan compared statistical with syntactic, statistics won, just – J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87 -868 - Department of Computer Science, Cornell University • More research has been conducted. – T. Strzalkowski (1995) Natural language information retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397 -417

Check out TREC • Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology) – http: //trec. nist. gov/ – Ad hoc track • Fairly even between statistical phrases, syntactic phrases and no phrases.

Grammatical tagging? • Tag document text with grammatical codes? – R. Garside (1987). The CLAWS word tagging system, in The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds. , Longman: 30 -41. • Doesn’t appear to work – R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13 th ACM SIGIR Conference: 179 -191.

Syntactic structure comparison • Has been tried… – A. F. Smeaton & P. Sheridan (1991) Using morphosyntactic language analysis in phrase matching, in Proceedings of RIAO ‘ 91, Pages 414 -429 • Method – Parse sentences into tree structures – When you get a phrase match • Look at linking syntactic operator. • Look at the residual tree structure that didn’t match • Does not to work

Semantic • Disambiguation – Given a word appearing in a certain context, disambiguators will tell you what sense it is. • IR system – Index document collections by senses rather than words – Ask the users what senses the query words are – Retrieve on senses

Disambiguation • Does it work? – No (well maybe) • M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17 th ACM SIGIR Conference, Pages 142 -151, 1994 • M. Sanderson & C. J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440 -465.

Partial conclusions • NLP has yet to prove itself in IR – Agree – D. D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92 -101 – Sort of don’t agree – A. Smeaton (1992) Progress in the application of natural language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.

Mark’s idle speculation • What people think is going on always Keywords NLP

Mark’s idle speculation • What’s usually actually going on Keywords NLP

Areas where NLP does work • Systems with the following ingredients. – Collection documents cover small domain. – Language use is limited in some manner. – User queries cover tight subject area. – Documents/queries very short • Image captions – LSI, pseudo-relevance feedback – People willing to spend money getting NLP to work

RIME & IOTA • From Grenoble – Y. Chiaramella & J. Nie (1990) A retrieval model based on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13 th SIGIR conference, Pages 25 -43 • Medical record retrieval system • Some database’y parts • Free text descriptions of cases

Indexing • “an opacity affecting probably the lung and the trachea” {[p], SGN} SGN - observed sign LOC - localisation {[and], SGN} {[bears-on], SGN} {[opacity], SGN} {[lung], LOC} {[bears-on], SGN} {[opacity], SGN} {[trachea], LOC}

Retrieval • How do we match a user’s query to these structures? – Using transformations - bit like logic. t - uncertainty {[bears-on], SGN} {[opacity], SGN} Þ Þ {[lung], LOC}, t {[opacity], SGN}, t

Tree transformation {[has-for-value], SGN} {[bears-on], SGN} {[opacity], SGN} {[has-for-value], SGN} {[lung], LOC} {[contour], SGN} {[blurred], LOC} Þ {[has-for-value], SGN}, t {[opacity], SGN} {[has-for-value], SGN} {[contour], SGN} {[blurred], LOC}

Term transforms • Basic medical terms stored in a hierarchy. – Transformations possible again with uncertainty added. Level 1 tumour Level 2 cancer hygroma kyste pseudokyst polyp Level 3 sarcoma polykystosis polyposis

Isn’t this a bit slow? • Yes • Optimisation – Scan for potential documents. – Process them intensively. • Evaluation? – Not in that paper.

Not unique • SCISOR – P. S. Jacobs & L. F. Rau (1990) SCISOR: Extracting Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88 -97

Why do they work? • Because of the restrictions – Small subject domain. – Limited vocabulary. – Restricted type of question. • Compare with large scale IR system. – Keywords are good enough. – Long time to set up. – Hard to adapt to new domain.

Anything else for NLP? • Text Generation – IR system explaining itself?

Conclusions • By now, you will be able to – Outline a range of attempts to get NLP to work with IR systems – Idly speculate on why they failed – Describe the successful use of NLP in a limited domain