ICT 619 Intelligent Systems Topic 9 Natural Language

  • Slides: 44
Download presentation
ICT 619 Intelligent Systems Topic 9: Natural Language Processing and Language Technology ICT 619

ICT 619 Intelligent Systems Topic 9: Natural Language Processing and Language Technology ICT 619 S 2 -05

What is natural language processing (NLP)? § An ideal goal for human-computer communication is

What is natural language processing (NLP)? § An ideal goal for human-computer communication is the ability to communicate in a natural language § NLP grew as a sub-domain of AI and linguistics - the task of developing software capable of understanding information (commands, text) expressed in a natural language in order to achieve specific goals § Understanding natural languages is a challenging task for computers § Due to ambiguities, frequent use of context and the overall knowledge acquisition and use problem ICT 619 2

Speech (voice) recognition and natural language processing § Speech recognition concerns understanding spoken commands

Speech (voice) recognition and natural language processing § Speech recognition concerns understanding spoken commands or sentences from voice inputs Example: Telstra’s directory assistance § A speech recognition system must first extract and recognise words from audio input § We might also like the system to be able to answer in speech - this requires speech generation as well § In NLP, input is already available in machine-readable form (eg words as Unicode text) § Future improvements of speech recognition will to some extent depend on progress in NLP ICT 619 3

Speech Recognnition – The state-ofthe-art § § 60 -90% accuracy - good enough for

Speech Recognnition – The state-ofthe-art § § 60 -90% accuracy - good enough for general dictation Speaker dependent – needs training Cheap desktop software available Example: IBM Via. Voice, Dragon Naturally Speaking § Issues: § Isolated vs. continuous speech § Vocabulary size § Better speaker independence ICT 619 4

Language Technology § Covers all areas related to NLP with a practical focus §

Language Technology § Covers all areas related to NLP with a practical focus § Language technology is defined as: The application of knowledge about human language in computer-based solutions § Applications covered by language technology include: § Spoken language dialogue systems (speech recognition, some understanding, and speech generation) § Machine translation § Text summarisation § Information retrieval ICT 619 5

Language Technology (cont’d) § The input to a language technology system may be provided

Language Technology (cont’d) § The input to a language technology system may be provided through § speech recognition § optical character recognition (OCR) § handwriting recognition and § the output may be in the form of speech or tailored documents, or web pages. ICT 619 6

Approaches to natural language processing Main Approaches § Keyword searching § Linguistic analysis §

Approaches to natural language processing Main Approaches § Keyword searching § Linguistic analysis § AI-based § ANN-based § Statistical analysis Keyword searching systems § Early NLP systems - and some in use today - are based on keyword searching (pattern matching) ICT 619 7

Keyword searching NLP systems § Selected keywords or phrases are searched for in the

Keyword searching NLP systems § Selected keywords or phrases are searched for in the input sentence § The program responds with specific pre-stored responses based on the keywords or phrases § Program may actually construct a response based on a partial reply coupled with keywords and phrases from the input § No real understanding of the input is involved ICT 619 8

Keyword searching NLP systems (cont’d) The most well known example - ELIZA program from

Keyword searching NLP systems (cont’d) The most well known example - ELIZA program from MIT mid-1960 s ICT 619 9

Keyword systems § Limitations § Inflexible - really just reactive responses § Unable to

Keyword systems § Limitations § Inflexible - really just reactive responses § Unable to cope with anything not in their keyword look-up tables, and § No knowledge modelling § Today’s more sophisticated NLP systems § Try to understand the content of language by doing syntactical, semantic and pragmatic analyses § May be able to do some conceptual modelling § Better able to maintain continuous dialogues § Attempt to cope with the ambiguity and other features common in natural language ICT 619 10

Other approaches to NLP Linguistic analysis approach § Based on encoding formal grammar rules

Other approaches to NLP Linguistic analysis approach § Based on encoding formal grammar rules for sentence-level processing § A linguistically-oriented system focuses on the syntax and semantics AI based systems § Focuses on using world knowledge to understand language § One example of an AI-based NLP system is BORIS § written by Michael Dyer, a student of Roger Schank's § a story understanding program that reads a narrative and answers questions about it ICT 619 11

AI-based NLP example - BORIS Richard hadn’t heard from his college roommate Paul for

AI-based NLP example - BORIS Richard hadn’t heard from his college roommate Paul for years. Richard had borrowed money from Paul which was never paid back. But now he had no idea where to find his old friend. When a letter finally arrived from San Francisco, Richard was anxious to find out how Paul was. Q: What happened to Richard at home? BORIS: Richard got a letter from Paul. Q: Who is Paul? BORIS: Richard’s friend. Q: Did Richard want to see Paul? BORIS: Yes, Richard wanted to know how Paul was. Q: Had Paul helped Richard? BORIS: Yes, Paul lent money to Richard. The BORIS system (from Roger Schank and Peter Childers, The Cognitive Computer). ICT 619 12

Artificial neural networks based NLP ANN based systems § Uses ANNs for processing language,

Artificial neural networks based NLP ANN based systems § Uses ANNs for processing language, particularly for lexical disambiguation § A neural net is trained to disambiguate by using context § Trained presents units of 6 or so words containing target word to be learned § Example: Disambiguation of word “bank” in “We got a bank loan to buy a house” § Two possible senses: money sense, river sense § Groups of co-occurring words (neighbourhoods): § Money sense: bank money loan branch fee robbery § River sense: bank river bridge erosion earth slope ICT 619 13

Statistical approach to NLP Linguistic approach § Based on extracting statistically significant information tags

Statistical approach to NLP Linguistic approach § Based on extracting statistically significant information tags - from large corpora or bodies of text (millions of words) and using these as very general indexes to model parts or responses § Valuable because it does not require as much handmodelling of knowledge, but acquires the tags automatically § Statistical methods are now receiving much attention, and more systems are likely to incorporate them in future. § Most NLP systems use a combination of the linguistic and AI approaches ICT 619 14

Components of NLP systems § Five major elements: the parser, the lexicon, the semantic

Components of NLP systems § Five major elements: the parser, the lexicon, the semantic analyser, the knowledge base, and the generator ICT 619 15

Components of NLP systems (cont’d) § A syntactical parser analyses the input sentence using

Components of NLP systems (cont’d) § A syntactical parser analyses the input sentence using the language's grammar or rules of syntax § Output produced is a structural description of the sentence - known as a parse tree § Some rules of syntax for English: S = NP + VP S : sentence NP: noun phrase VP: predicate or verb phrase The noun phrase can be more than a single noun NP = D + ADJ + N § D: determiner (D) eg, “a”, “this”, ADJ: adjective, N: main noun ICT 619 16

Components of NLP systems (cont. ) The lexicon § An internal dictionary used to

Components of NLP systems (cont. ) The lexicon § An internal dictionary used to perform the syntactic and semantic analysis § Contains semantic and grammatical information (eg, part-of-speech) about words or word strings Fig. An example parse tree for the sentence “Mary had a little lamb” ICT 619 17

The semantic analyser and the knowledge base § The semantic analyser uses the parse

The semantic analyser and the knowledge base § The semantic analyser uses the parse tree and the knowledge base to try to determine what the sentence means § It creates another data structure that represents the meaning of the input sentences § It can also draw inferences from input statements using general knowledge in the KB § The semantic analyser's data structure and those in the KB should be in a common knowledge representation, such as KQML or Conceptual Graphs ICT 619 18

The Generator § The generator uses the KB data structure created by the semantic

The Generator § The generator uses the KB data structure created by the semantic analyser to create a usable output § The response depends in part on the pragmatics of the input language eg greetings require greetings, questions require answers, commands require actions § The data structure can be used to initiate some action, § eg the language system is a front-end to a DBMS. The generator writes commands in a query language to begin a search § Simple generators feed standard pre-stored output responses to the user based on the built meaning representation § More sophisticated generators construct an original response by instantiating templates based on models of language use ICT 619 19

Applications of NLP - Natural language interfaces (NLI) § An NLP system can be

Applications of NLP - Natural language interfaces (NLI) § An NLP system can be the front-end of information systems to provide a more user-friendly interface § Eg, the command: “List details of all files in this folder sorted by time of creation” much friendlier than ”ls –atl”, especially using voice input § An NLI processes sentences exchanged between a user and an application § Task made easier by the restricted domain of discourse, eg, in databases due to their highly restricted domain containing information on a single area of application § Interfaces for expert systems, operating systems and document retrieval systems are also being developed ICT 619 20

Homer: A Language-using Agent Source: Vere, S. & Bickmore, T. A Basic Agent Computational

Homer: A Language-using Agent Source: Vere, S. & Bickmore, T. A Basic Agent Computational Intelligence, 199

Natural Language - Homer TIM> Drop. What the package barge next Saturday at 9

Natural Language - Homer TIM> Drop. What the package barge next Saturday at 9 pm. STEVE> is in frontatofthe you? HOMER> OK. A log. TIM> Are Do youyou going tothe be at the pier next Saturday? STEVE> own log? HOMER> Yes. HOMER> No I don’t. TIM> Why. The arelog youbelongs going totothe pier? STEVE> you. HOMER> go to the pier in order to pick up the package. HOMER> I’ll Oh. TIM> What timeeat aregrass. you going to pick it up? STEVE> Cows HOMER> 8: 56 pm. HOMER> I know. STEVE> the package be next Saturday? STEVE> Where Do youwill own the log now? ICT 619 HOMER> the. I do. barge. HOMER> At Yes 22

Examples of commercial NL : Intellect (Trinzic Corp. ) § One of the most

Examples of commercial NL : Intellect (Trinzic Corp. ) § One of the most widely used natural language front-end interfaces available for mainframes § Designed for use with DBMS under IBM operating systems environments § In addition to allowing access to data in a database, Intellect allows creation of databases using natural language § The built-in lexicon may be modified to fit a particular application ICT 619 23

Q&A (Symantec Corp. ) § A basic file manager with a natural language front-end

Q&A (Symantec Corp. ) § A basic file manager with a natural language front-end called “The Intelligent Assistant” § Parses common English input questions and converts them into queries that the file manager can understand § Paraphrases input requests to ensure full understanding of what user wants § Eg, User input: Show the total 1992 sales for the Central Region § Q&A Intelligent Assistant’s response: Shall I do the following? Create a report showing the amount of sales for the central region in 1992? Y(es) – Continue N(o) – Cancel request § Semantec discontinued and then sold Q&A to a German company called CAB Gmb. H. ICT 619 24

Machine translation Goal: § To support translation of some language into a language other

Machine translation Goal: § To support translation of some language into a language other than the original Applications include: § Desktop and web-based translation services § Spoken language translation services (eg phone-based) Requirements: § Understanding meaning of input sentences § This would involve a semantic analysis of the input using semantic knowledge § An automatic translation system is expected to be robust and not stop whenever it encounters an item it cannot understand ICT 619 25

Machine translation (cont’d) Current approaches use a transfer grammar § Input text Partial analysis

Machine translation (cont’d) Current approaches use a transfer grammar § Input text Partial analysis 1 st Intermediate representation of content (related to the source language) § Intermediate representation Transformation using a transfer grammar 2 nd intermediate representation (related to the target language) § 2 nd intermediate representation NL generator Text in target language § Machine translation as performed since mid-1960 s is not true “understanding” of text § By 1991, systems that could process sentences with limited vocabulary started appearing ICT 619 26

Current state-of-the-art of machine translation § Broad coverage MT systems already available on the

Current state-of-the-art of machine translation § Broad coverage MT systems already available on the Web with fast turnaround time and acceptable error rate § Higher accuracy achieved by domain-specific systems § For example, controlled language used in Caterpillar manuals Machine translation products § Bowne Global Solution’s i. Translator § www. itranslator. com § Systran’s Babel Fish (used by Alta. Vista) § www. systransoft. com ICT 619 27

Current state-of-the-art of machine translation (cont’d) An example: Systran’s Web-based Translator ICT 619 28

Current state-of-the-art of machine translation (cont’d) An example: Systran’s Web-based Translator ICT 619 28

Spoken language dialogue systems § Communicate with users via automatic speech recognition and text-to-speech

Spoken language dialogue systems § Communicate with users via automatic speech recognition and text-to-speech interfaces § Mediate the user’s access to a back-end database Examples: § Information services: stock quotes, timetables § Transaction services: banking, betting, flight reservations § Current technology has been claimed to be capable of reducing call centre costs from $75 to 18 c a call Some issues: § Telephony-based systems cannot afford a training period § Making a conversation too realistic falsely raises user expectations and can confuse the system ICT 619 29

Spoken language dialog systems (cont’d) More issues: § Error handling is a significant issue

Spoken language dialog systems (cont’d) More issues: § Error handling is a significant issue § Giving initiative to the user increases difficulty Some relatively successful examples: § A Sydney taxi booking service (about 30% of cases have to go to human operators). § Telstra directory assistance service (15 -20% accuracy but 15 -20% of automation may be useful enough) § Spoken language dialog systems fielded applications: § Nuance (www. nuance. com) § Scan. Soft/Speech. Works( (www. scansoft. com) § Philips (www. speech. philips. com) ICT 619 30

Text processing § A number of different applications dealing with the processing of continuous

Text processing § A number of different applications dealing with the processing of continuous text may be grouped together under this heading § Editing tools § Most common example: spelling and syntax (or grammar) checkers Characterised by avoidance of deep semantic processing § § Content extraction Concerns extraction of specific information from texts Examples: Extraction of information related to financial transaction from a bank telex or of bibliographic information from research papers ICT 619 31

Text processing (cont’d) § Content extraction (cont’d) § Requires deep semantic analysis which is

Text processing (cont’d) § Content extraction (cont’d) § Requires deep semantic analysis which is aided by the restricted domain and a priori knowledge of the information to be extracted § Commercial systems exist for electronic mail processing, banking systems and automatic summary generation § Examples: § ATRANS from Cognitive Systems § DEAL-READER from Gecosys ICT 619 32

Text processing (cont. ) Text summarisation Objective: § To produce a version of a

Text processing (cont. ) Text summarisation Objective: § To produce a version of a document shorter than the original document § Applications of text summarisation are found in § Information browsing § Voice delivery of Web pages and email § Issues concerning text summarisation § Different kinds of summaries: § Indicative (what is it about? ) vs Informative (what is there of interest to user? ) § Real summarisation requires real understanding ICT 619 33

Text summarisation state-of-the-art § Commercial systems work on a ‘sentence-extraction’ model Sentences regarded as

Text summarisation state-of-the-art § Commercial systems work on a ‘sentence-extraction’ model Sentences regarded as ‘important’ are extracted and put together § Importance of sentences decided on the basis of location, inclusion of key words, statistical information such as frequency § Current systems are relatively knowledge-free § Not based on real understanding of the text § § Some text summarisation applications currently available: Cogn. IT’s CORPORUM (www. cognit. com) INXight’s Summarizer (www. inxight. com) MS Word’s summarisation tool ICT 619 34

Search and Information Retrieval § Ever increasing amount of information available worldwide, particularly on

Search and Information Retrieval § Ever increasing amount of information available worldwide, particularly on the Internet § Searching for and retrieving information relevant to a topic of interest an active area of research and application. § § § Document retrieval (DR) Also known as text retrieval Involves retrieving text ranging from paragraph to book length for humans to read § DR may involve § searching well-maintained bibliographic databases § scanning hard disks for missing files § searching thousands of Web servers for natural language articles on a topic of interest ICT 619 35

Search and Information Retrieval (cont’d) § Efficacy of a DR system measured by §

Search and Information Retrieval (cont’d) § Efficacy of a DR system measured by § Precision –proportion retrieved that are relevant, and § Recall –proportion of relevant documents retrieved § Retrieval depends on indexing - indicating what documents are about § Indexing requires an indexing language, a term vocabulary, and a method for constructing requests and document descriptions § Both controlled language indexing and the more sophisticated natural language indexing require NLP capabilities § Compact descriptions of a document’s significance may increase the efficiency of matching § Increasing both recall and precision is the fundamental goal of index languages ICT 619 36

Search and Information Retrieval (cont’d) Current topics of interest in search and information retrieval

Search and Information Retrieval (cont’d) Current topics of interest in search and information retrieval include: § In a concept-based search, documents are characterised by relevant concepts and not just key words § For example, a search for ‘car’ should also retrieve documents on 'automobiles' § Named entity recognition involves recognising names of peoples, places, organisations etc. § One person or organisation can be referred to by many name variants – eg, John Howard, Mr. Howard, J. W. Howard, the PM § Many persons or organisations can share the same name – eg, politician John Howard, actor John Howard ICT 619 37

Search and Information Retrieval (cont’d) Search and Information Retrieval State-of-the-art § Current trend (eg

Search and Information Retrieval (cont’d) Search and Information Retrieval State-of-the-art § Current trend (eg Google) is to expand the search vocabulary by using thesauri (eg, ‘car’ ‘automobile’) § Linguistic analysis to identify phrases relevant to the initial query § Key phrases can be more useful than just key word § Can be used to expand an initial user query (Khan & Khor 2004) § Some current search and information retrieval applications: § Ultra Find: www. ultradesign. com/untrafind/ultrafind. html § Lotus Discovery Server: www. lotus. com/products/discserver. nsf § Smart text processing suites: § Inxight: www. inxight. com § Verity: wwwl. verity. com ICT 619 38

Challenges faced by NLP § A good NLP system must be capable of handling

Challenges faced by NLP § A good NLP system must be capable of handling common linguistic problems caused by ambiguities and the use of context § Prepositional phrase attachment § A sentence can often be analysed in more than one way, producing multiple parse trees for the sentence. § Example sentence: § “John saw the boy in the park with a telescope” has 3 possible parses Without contextual knowledge, it is not known whether John was looking through the telescope, the boy had a telescope, or the park had a telescope in it. ICT 619 39

Challenges faced by NLP (cont’d) Lexical ambiguity § When words have multiple meanings §

Challenges faced by NLP (cont’d) Lexical ambiguity § When words have multiple meanings § A classic example: § Time flies like an arrow. § Fruit flies like a banana. § In the first case, “flies” is a verb and “like” is an adverb § In the second case, “flies” is a noun and “like” is a verb. ICT 619 40

Challenges faced by NLP (cont. ) Anaphoric reference or pronoun resolution § Problem of

Challenges faced by NLP (cont. ) Anaphoric reference or pronoun resolution § Problem of figuring out what a pronoun refers to § Example: Give me the names of all managers and how much they earn. (1) Mary went to see Jane. She was happy to see her (2) § In (1), easy to decide that “they” refers to the managers already mentioned § In (2), difficult to decide who “she” and “her” refer to – was Mary happy to see Jane, or was Jane happy to see Mary? ICT 619 41

Challenges faced by NLP (cont. ) Ellipsis § Sentences appearing to have parts missing

Challenges faced by NLP (cont. ) Ellipsis § Sentences appearing to have parts missing § Example § John works in Personnel, Mary in Accounting. “Mary in accounting” lacks a verb but is understandable using context of entire sentence “Mary in accounting” is an elliptical form of “Mary works in accounting”. ICT 619 42

Challenges faced by NLP (cont. ) Quantifier scope § Quantifiers such as “all”, “every”,

Challenges faced by NLP (cont. ) Quantifier scope § Quantifiers such as “all”, “every”, “some”, and “no” can be ambiguous § Example: § Every employee does not like Mr Smith Meaning - not a single employee likes Mr Smith or - some do and some don’t. § No current NLP system can handle all of these problems – no unrestricted NLP system yet § Yet some such as HOMER can handle the most common forms ICT 619 43

REFERENCES § Germain, E. , Introducing Natural Language Processing, AI Expert, August 1992, pp.

REFERENCES § Germain, E. , Introducing Natural Language Processing, AI Expert, August 1992, pp. 30 -35. § Lewis, D. D. , and Jones, K. S. , Natural Language Processing for Information retrieval, Communications of the ACM Vol. 39, No. 1 (January 1996), pp. 92 -100. § Turban, E. , Decision Support and Expert Systems, Prentice Hall, Englewood Cliffs, New Jersey, 1995, pp. 242 -257. § Thayse, A. (Editor), From Natural Language Processing to Logic for Expert Systems, John Wiley & Sons, 1991. § Cole, R. , Zaenen A. , & Zampolli (eds), Survey of the State of the Art in Human Language technology, Cambridge University Press, 1998 § Available on the web: http: //cslu. cse. ogi. edu/HLTsurvey/ § Dale, R. , Language Technology: Applications and Techniques Tutorial 2004, The 8 th Pacific Rim Int. Conf. on Artificial Intelligence, Auckland, 9 -13 August, 2004. § Khan, M. S. , and Khor, S. “Automatic Query Expansion for Enhanced Web Document Retrieval”, Journal of the American Society for Information Science and Technology, Vol. 55, No. 1, 2004, pp. 29 -40. ICT 619 44